@yeyuan98/opencode-bioresearcher-plugin 1.5.2 → 1.5.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,944 +1,944 @@
1
- # Python Code Standards (DRY Principle)
2
-
3
- Comprehensive documentation requirements and DRY (Don't Repeat Yourself) principle enforcement for all Python scripts.
4
-
5
- ## Overview
6
-
7
- This pattern defines mandatory standards for Python code written by the bioresearcher agent:
8
- - Complete documentation (module and function docstrings)
9
- - DRY principle enforcement (no code duplication)
10
- - Type hints and validation
11
- - Error handling best practices
12
- - Reusable component design
13
-
14
- ---
15
-
16
- ## 1. Module-Level Documentation
17
-
18
- ### Required Elements
19
-
20
- **EVERY Python script MUST include:**
21
-
22
- ```python
23
- #!/usr/bin/env python3
24
- """[Script Purpose - One Line Description]
25
-
26
- This module provides functionality for:
27
- - [Functionality 1 - brief description]
28
- - [Functionality 2 - brief description]
29
- - [Functionality 3 - brief description]
30
-
31
- Usage:
32
- uv run python script_name.py [command] [options]
33
-
34
- Examples:
35
- uv run python script_name.py process --input data.xlsx --output results.json
36
- uv run python script_name.py validate --schema schema.json
37
- uv run python script_name.py analyze --config config.yaml
38
-
39
- Dependencies:
40
- - pandas >= 1.5.0
41
- - openpyxl >= 3.0.0
42
- - [Other dependencies with version requirements]
43
-
44
- Configuration:
45
- [Any configuration files or environment variables needed]
46
-
47
- Output:
48
- [Description of output format and location]
49
-
50
- Author: BioResearcher AI Agent
51
- Date: YYYY-MM-DD
52
- Version: 1.0.0
53
- """
54
- ```
55
-
56
- ### Example: Well-Documented Module
57
-
58
- ```python
59
- #!/usr/bin/env python3
60
- """Clinical Trial Response Rate Analysis
61
-
62
- This module provides functionality for:
63
- - Loading and validating clinical trial data from Excel/CSV files
64
- - Calculating response rates across trial phases and conditions
65
- - Performing statistical comparisons (chi-square, t-test)
66
- - Generating summary reports with visualizations
67
-
68
- Usage:
69
- uv run python trial_analysis.py analyze --input trials.xlsx --output results/
70
- uv run python trial_analysis.py compare --phase1 "Phase 2" --phase2 "Phase 3"
71
- uv run python trial_analysis.py report --results results/ --output report.md
72
-
73
- Dependencies:
74
- - pandas >= 1.5.0
75
- - scipy >= 1.9.0
76
- - openpyxl >= 3.0.0
77
- - matplotlib >= 3.6.0
78
-
79
- Configuration:
80
- Requires env.jsonc with database connection (optional)
81
-
82
- Output:
83
- - JSON files with statistical results
84
- - Excel files with aggregated data
85
- - Markdown report with findings
86
-
87
- Author: BioResearcher AI Agent
88
- Date: 2024-01-15
89
- Version: 1.0.0
90
- """
91
- ```
92
-
93
- ---
94
-
95
- ## 2. Function Documentation
96
-
97
- ### Standard Docstring Format
98
-
99
- **Required sections:**
100
- 1. Brief description (one line)
101
- 2. Extended description (if needed)
102
- 3. Args (with types and descriptions)
103
- 4. Returns (with type and description)
104
- 5. Raises (if applicable)
105
- 6. Example (recommended)
106
-
107
- ### Template
108
-
109
- ```python
110
- def function_name(
111
- param1: Type1,
112
- param2: Type2,
113
- optional_param: Type3 = default_value
114
- ) -> ReturnType:
115
- """Brief one-line description of function.
116
-
117
- Extended description explaining what the function does,
118
- why it's needed, and any important behavior notes.
119
-
120
- Args:
121
- param1: Description of param1
122
- param2: Description of param2
123
- Can span multiple lines for complex parameters
124
- optional_param: Description of optional parameter
125
- Default: default_value
126
-
127
- Returns:
128
- Description of return value
129
- For complex returns, describe structure:
130
- {
131
- "key1": type - description,
132
- "key2": type - description
133
- }
134
-
135
- Raises:
136
- ExceptionType: When/why this exception is raised
137
- AnotherException: Description
138
-
139
- Example:
140
- >>> result = function_name(param1="value", param2=42)
141
- >>> print(result["key"])
142
- expected_output
143
- """
144
- # Implementation
145
- pass
146
- ```
147
-
148
- ### Example: Well-Documented Function
149
-
150
- ```python
151
- def analyze_clinical_trials(
152
- trials_data: List[Dict[str, Any]],
153
- filter_criteria: Dict[str, str],
154
- output_format: str = "json"
155
- ) -> Dict[str, Any]:
156
- """Analyze clinical trial data with specified filters.
157
-
158
- Performs comprehensive analysis of clinical trial records including:
159
- - Filtering by phase, status, and condition
160
- - Statistical aggregation by sponsor and location
161
- - Response rate comparison across trial phases
162
-
163
- Args:
164
- trials_data: List of trial dictionaries from dbQuery or API
165
- Each dict must contain: nct_id, phase, status, condition,
166
- sponsor, response_rate, patient_count
167
- filter_criteria: Dictionary of filter conditions
168
- Example: {"phase": "Phase 3", "status": "Recruiting"}
169
- Supported keys: phase, status, condition, sponsor
170
- output_format: Output format for results
171
- Options: "json", "dataframe", "dict"
172
- Default: "json"
173
-
174
- Returns:
175
- Dictionary containing analysis results:
176
- {
177
- "filtered_count": int - Number of trials after filtering,
178
- "statistics": {
179
- "avg_response_rate": float,
180
- "total_patients": int,
181
- "by_phase": Dict[str, Dict]
182
- },
183
- "trials": List[Dict] - Filtered trial records
184
- }
185
-
186
- Raises:
187
- ValueError: If filter_criteria contains unsupported keys
188
- KeyError: If trials_data missing required fields
189
- TypeError: If trials_data is not a list
190
-
191
- Example:
192
- >>> trials = dbQuery("SELECT * FROM clinical_trials")
193
- >>> results = analyze_clinical_trials(
194
- ... trials_data=trials,
195
- ... filter_criteria={"phase": "Phase 3"},
196
- ... output_format="json"
197
- ... )
198
- >>> print(results["filtered_count"])
199
- 42
200
- >>> print(results["statistics"]["avg_response_rate"])
201
- 0.65
202
- """
203
- # Validate inputs
204
- _validate_filter_criteria(filter_criteria)
205
- validated_data = _validate_trials(trials_data)
206
-
207
- # Apply filters (reusable function)
208
- filtered = _apply_filters(validated_data, filter_criteria)
209
-
210
- # Calculate statistics (reusable function)
211
- stats = _calculate_statistics(filtered)
212
-
213
- # Format output
214
- if output_format == "json":
215
- return {
216
- "filtered_count": len(filtered),
217
- "statistics": stats,
218
- "trials": filtered
219
- }
220
- elif output_format == "dataframe":
221
- import pandas as pd
222
- return {
223
- "filtered_count": len(filtered),
224
- "statistics": stats,
225
- "trials": pd.DataFrame(filtered)
226
- }
227
- else:
228
- return {
229
- "filtered_count": len(filtered),
230
- "statistics": stats,
231
- "trials": filtered
232
- }
233
- ```
234
-
235
- ---
236
-
237
- ## 3. DRY Principle Enforcement
238
-
239
- ### Core Principles
240
-
241
- 1. **Single Responsibility** - Each function does ONE thing
242
- 2. **No Code Duplication** - If code appears twice, extract to function
243
- 3. **Reusable Components** - Design functions for reuse across scripts
244
- 4. **Configuration Over Hardcoding** - Use parameters, not hardcoded values
245
-
246
- ### Violation Examples and Corrections
247
-
248
- #### Violation 1: Code Duplication
249
-
250
- ```python
251
- # ❌ BAD: Duplicated validation logic
252
- def analyze_trials(trials):
253
- """Analyze trials with inline validation."""
254
- for trial in trials:
255
- if not trial.get("nct_id"):
256
- raise ValueError("Missing NCT ID")
257
- if not trial.get("phase"):
258
- raise ValueError("Missing phase")
259
- if not trial.get("status"):
260
- raise ValueError("Missing status")
261
- # ... analysis logic
262
-
263
- def export_trials(trials):
264
- """Export trials with inline validation."""
265
- for trial in trials:
266
- if not trial.get("nct_id"):
267
- raise ValueError("Missing NCT ID")
268
- if not trial.get("phase"):
269
- raise ValueError("Missing phase")
270
- if not trial.get("status"):
271
- raise ValueError("Missing status")
272
- # ... export logic
273
- ```
274
-
275
- ```python
276
- # ✅ GOOD: DRY principle - reusable validation function
277
- def _validate_trial(trial: Dict[str, Any]) -> bool:
278
- """Validate single trial record.
279
-
280
- Checks for required fields and data types.
281
- Reusable across multiple functions.
282
-
283
- Args:
284
- trial: Trial dictionary to validate
285
-
286
- Returns:
287
- True if valid
288
-
289
- Raises:
290
- ValueError: If required fields missing or invalid
291
- """
292
- required_fields = {
293
- "nct_id": str,
294
- "phase": str,
295
- "status": str,
296
- "response_rate": (int, float),
297
- "patient_count": int
298
- }
299
-
300
- for field, expected_type in required_fields.items():
301
- if not trial.get(field):
302
- raise ValueError(f"Missing required field: {field}")
303
- if not isinstance(trial[field], expected_type):
304
- raise ValueError(
305
- f"Field {field} has wrong type: "
306
- f"expected {expected_type}, got {type(trial[field])}"
307
- )
308
-
309
- return True
310
-
311
- def analyze_trials(trials: List[Dict]) -> Dict:
312
- """Analyze trials with reusable validation."""
313
- validated = [_validate_trial(t) for t in trials]
314
- # ... analysis logic
315
-
316
- def export_trials(trials: List[Dict], output_path: str) -> None:
317
- """Export trials with reusable validation."""
318
- validated = [_validate_trial(t) for t in trials]
319
- # ... export logic
320
- ```
321
-
322
- #### Violation 2: Hardcoded Values
323
-
324
- ```python
325
- # ❌ BAD: Hardcoded configuration
326
- def analyze_data(data):
327
- """Analyze with hardcoded thresholds."""
328
- if len(data) > 1000:
329
- print("Large dataset detected")
330
-
331
- for item in data:
332
- if item["value"] > 0.5: # Magic number
333
- item["category"] = "High"
334
- else:
335
- item["category"] = "Low"
336
-
337
- return data
338
- ```
339
-
340
- ```python
341
- # ✅ GOOD: Configuration via parameters
342
- def analyze_data(
343
- data: List[Dict],
344
- large_dataset_threshold: int = 1000,
345
- category_threshold: float = 0.5
346
- ) -> List[Dict]:
347
- """Analyze with configurable thresholds.
348
-
349
- Args:
350
- data: List of data items
351
- large_dataset_threshold: Threshold for large dataset warning
352
- Default: 1000
353
- category_threshold: Threshold for High/Low categorization
354
- Default: 0.5
355
-
356
- Returns:
357
- Data with categories assigned
358
- """
359
- if len(data) > large_dataset_threshold:
360
- print(f"Large dataset detected: {len(data)} items")
361
-
362
- for item in data:
363
- if item["value"] > category_threshold:
364
- item["category"] = "High"
365
- else:
366
- item["category"] = "Low"
367
-
368
- return data
369
- ```
370
-
371
- #### Violation 3: Multiple Responsibilities
372
-
373
- ```python
374
- # ❌ BAD: Function does too many things
375
- def process_trials(file_path):
376
- """Load, validate, filter, analyze, and export trials."""
377
- # Load data
378
- df = pd.read_excel(file_path)
379
-
380
- # Validate
381
- if df.empty:
382
- raise ValueError("Empty file")
383
-
384
- # Filter
385
- df = df[df["phase"] == "Phase 3"]
386
-
387
- # Analyze
388
- stats = df.groupby("condition")["response_rate"].mean()
389
-
390
- # Export
391
- stats.to_excel("output.xlsx")
392
-
393
- # Visualize
394
- stats.plot(kind="bar")
395
- plt.savefig("plot.png")
396
-
397
- return stats
398
- ```
399
-
400
- ```python
401
- # ✅ GOOD: Single responsibility per function
402
- def load_trials(file_path: str) -> pd.DataFrame:
403
- """Load trials from Excel file."""
404
- return pd.read_excel(file_path)
405
-
406
- def validate_trials(df: pd.DataFrame) -> pd.DataFrame:
407
- """Validate trial data."""
408
- if df.empty:
409
- raise ValueError("Empty file")
410
- required_cols = ["nct_id", "phase", "response_rate"]
411
- missing = [col for col in required_cols if col not in df.columns]
412
- if missing:
413
- raise ValueError(f"Missing columns: {missing}")
414
- return df
415
-
416
- def filter_trials(
417
- df: pd.DataFrame,
418
- phase: str = None,
419
- status: str = None
420
- ) -> pd.DataFrame:
421
- """Filter trials by criteria."""
422
- if phase:
423
- df = df[df["phase"] == phase]
424
- if status:
425
- df = df[df["status"] == status]
426
- return df
427
-
428
- def analyze_trials(df: pd.DataFrame) -> pd.Series:
429
- """Calculate statistics by condition."""
430
- return df.groupby("condition")["response_rate"].mean()
431
-
432
- def export_results(stats: pd.Series, output_path: str) -> None:
433
- """Export statistics to Excel."""
434
- stats.to_excel(output_path)
435
-
436
- def create_visualization(stats: pd.Series, output_path: str) -> None:
437
- """Create bar chart visualization."""
438
- import matplotlib.pyplot as plt
439
- stats.plot(kind="bar")
440
- plt.savefig(output_path)
441
-
442
- # Main workflow
443
- def process_trials(file_path: str, output_dir: str) -> pd.Series:
444
- """Process trials using modular functions."""
445
- df = load_trials(file_path)
446
- df = validate_trials(df)
447
- df = filter_trials(df, phase="Phase 3")
448
- stats = analyze_trials(df)
449
- export_results(stats, f"{output_dir}/stats.xlsx")
450
- create_visualization(stats, f"{output_dir}/plot.png")
451
- return stats
452
- ```
453
-
454
- ---
455
-
456
- ## 4. Type Hints
457
-
458
- ### Why Type Hints Matter
459
-
460
- - Improve code readability
461
- - Enable IDE autocomplete and type checking
462
- - Document expected types
463
- - Catch errors early
464
-
465
- ### Standard Type Hint Patterns
466
-
467
- ```python
468
- from typing import Dict, List, Any, Optional, Union, Tuple
469
-
470
- # Basic types
471
- def process_string(text: str) -> str:
472
- pass
473
-
474
- # Collections
475
- def process_list(items: List[str]) -> List[int]:
476
- pass
477
-
478
- def process_dict(data: Dict[str, Any]) -> Dict[str, float]:
479
- pass
480
-
481
- # Optional parameters
482
- def search(
483
- query: str,
484
- limit: Optional[int] = None
485
- ) -> List[Dict]:
486
- pass
487
-
488
- # Union types
489
- def parse_value(value: Union[str, int, float]) -> float:
490
- pass
491
-
492
- # Tuple returns
493
- def get_stats(data: List[float]) -> Tuple[float, float, float]:
494
- """Returns (mean, std, median)."""
495
- pass
496
-
497
- # Complex structures
498
- def analyze_trials(
499
- trials: List[Dict[str, Union[str, int, float]]]
500
- ) -> Dict[str, Any]:
501
- pass
502
- ```
503
-
504
- ---
505
-
506
- ## 5. Error Handling
507
-
508
- ### Validation Pattern
509
-
510
- ```python
511
- def analyze_data(data: List[Dict], threshold: float) -> Dict:
512
- """Analyze data with comprehensive validation.
513
-
514
- Args:
515
- data: List of data items
516
- threshold: Analysis threshold (0.0 to 1.0)
517
-
518
- Returns:
519
- Analysis results
520
-
521
- Raises:
522
- TypeError: If data is not a list
523
- ValueError: If threshold out of range or data empty
524
- """
525
- # Type validation
526
- if not isinstance(data, list):
527
- raise TypeError(f"data must be list, got {type(data)}")
528
-
529
- if not isinstance(threshold, (int, float)):
530
- raise TypeError(f"threshold must be numeric, got {type(threshold)}")
531
-
532
- # Value validation
533
- if not 0.0 <= threshold <= 1.0:
534
- raise ValueError(f"threshold must be 0.0-1.0, got {threshold}")
535
-
536
- if len(data) == 0:
537
- raise ValueError("data cannot be empty")
538
-
539
- # Field validation
540
- for i, item in enumerate(data):
541
- if "value" not in item:
542
- raise ValueError(f"Item {i} missing 'value' field")
543
-
544
- # Analysis logic
545
- results = [item for item in data if item["value"] > threshold]
546
-
547
- return {
548
- "count": len(results),
549
- "threshold": threshold,
550
- "items": results
551
- }
552
- ```
553
-
554
- ### Exception Handling Pattern
555
-
556
- ```python
557
- def load_and_process(file_path: str) -> Dict:
558
- """Load and process file with error handling.
559
-
560
- Args:
561
- file_path: Path to input file
562
-
563
- Returns:
564
- Processed data
565
-
566
- Raises:
567
- FileNotFoundError: If file doesn't exist
568
- ValueError: If file format invalid
569
- """
570
- import json
571
-
572
- # Try-except for external operations
573
- try:
574
- with open(file_path, 'r') as f:
575
- data = json.load(f)
576
- except FileNotFoundError:
577
- raise FileNotFoundError(f"File not found: {file_path}")
578
- except json.JSONDecodeError as e:
579
- raise ValueError(f"Invalid JSON in {file_path}: {e}")
580
-
581
- # Validate structure
582
- if not isinstance(data, dict):
583
- raise ValueError(f"Expected dict, got {type(data)}")
584
-
585
- # Process with error handling
586
- try:
587
- processed = _process_data(data)
588
- except Exception as e:
589
- raise RuntimeError(f"Processing failed: {e}")
590
-
591
- return processed
592
- ```
593
-
594
- ---
595
-
596
- ## 6. File Organization
597
-
598
- ### Directory Structure
599
-
600
- ```
601
- .scripts/py/
602
- ├── [topic]_analysis.py # Main analysis script
603
- ├── [topic]_utils.py # Reusable utilities
604
- ├── [topic]_config.py # Configuration constants
605
- └── requirements.txt # Dependencies (if needed)
606
- ```
607
-
608
- ### Script Structure
609
-
610
- ```python
611
- #!/usr/bin/env python3
612
- """Module docstring"""
613
-
614
- # 1. Imports
615
- import sys
616
- from typing import Dict, List, Any
617
- import pandas as pd
618
-
619
- # 2. Constants/Configuration
620
- DEFAULT_THRESHOLD = 0.5
621
- SUPPORTED_FORMATS = ["json", "csv", "excel"]
622
-
623
- # 3. Utility functions (private, prefix with _)
624
- def _validate_input(data: Any) -> bool:
625
- """Private utility function."""
626
- pass
627
-
628
- def _format_output(results: Dict) -> str:
629
- """Private utility function."""
630
- pass
631
-
632
- # 4. Main functions (public)
633
- def analyze_data(data: List[Dict]) -> Dict:
634
- """Public main function."""
635
- pass
636
-
637
- def export_results(results: Dict, output_path: str) -> None:
638
- """Public utility function."""
639
- pass
640
-
641
- # 5. CLI interface (if applicable)
642
- def main():
643
- """Command-line interface."""
644
- import argparse
645
-
646
- parser = argparse.ArgumentParser(description="...")
647
- parser.add_argument("command", choices=["analyze", "export"])
648
- parser.add_argument("--input", required=True)
649
- parser.add_argument("--output", required=True)
650
-
651
- args = parser.parse_args()
652
-
653
- # CLI logic
654
- if args.command == "analyze":
655
- data = load_data(args.input)
656
- results = analyze_data(data)
657
- export_results(results, args.output)
658
-
659
- # 6. Main execution block
660
- if __name__ == "__main__":
661
- main()
662
- ```
663
-
664
- ---
665
-
666
- ## 7. Code Quality Checklist
667
-
668
- Before finalizing any Python script, verify:
669
-
670
- ### Documentation
671
- - [ ] Module docstring complete (purpose, usage, dependencies)
672
- - [ ] All functions have docstrings
673
- - [ ] Args section complete with types
674
- - [ ] Returns section complete with types
675
- - [ ] Raises section for functions that can fail
676
- - [ ] Examples included for complex functions
677
-
678
- ### DRY Principle
679
- - [ ] No duplicated code blocks
680
- - [ ] Repeated logic extracted to functions
681
- - [ ] Single responsibility per function
682
- - [ ] Configuration via parameters, not hardcoded values
683
- - [ ] Utility functions reusable across scripts
684
-
685
- ### Type Hints
686
- - [ ] All function parameters have type hints
687
- - [ ] All return types specified
688
- - [ ] Complex types imported from typing module
689
- - [ ] Optional parameters marked as Optional[Type]
690
-
691
- ### Error Handling
692
- - [ ] Input validation at function start
693
- - [ ] Type checking for critical parameters
694
- - [ ] Try-except for external operations
695
- - [ ] Informative error messages
696
- - [ ] Proper exception types
697
-
698
- ### Code Style
699
- - [ ] Consistent naming (snake_case for functions/variables)
700
- - [ ] Functions are < 50 lines (if longer, split)
701
- - [ ] Logical grouping of related functions
702
- - [ ] Private functions prefixed with _
703
- - [ ] No unused imports or variables
704
-
705
- ---
706
-
707
- ## 8. Example: Complete Well-Structured Script
708
-
709
- ```python
710
- #!/usr/bin/env python3
711
- """Clinical Trial Statistical Analysis
712
-
713
- This module provides functionality for:
714
- - Loading trial data from multiple sources (Excel, CSV, database)
715
- - Calculating response rates and survival statistics
716
- - Performing statistical comparisons (chi-square, t-test, log-rank)
717
- - Generating comprehensive reports with visualizations
718
-
719
- Usage:
720
- uv run python trial_stats.py analyze --input trials.xlsx --output results/
721
- uv run python trial_stats.py compare --group1 Phase2 --group2 Phase3
722
- uv run python trial_stats.py report --results results/ --output report.md
723
-
724
- Dependencies:
725
- - pandas >= 1.5.0
726
- - scipy >= 1.9.0
727
- - openpyxl >= 3.0.0
728
- - matplotlib >= 3.6.0
729
- - lifelines >= 0.27.0
730
-
731
- Configuration:
732
- env.jsonc for database connection (optional)
733
-
734
- Output:
735
- - JSON files with statistical results
736
- - Excel files with aggregated data
737
- - PNG files with visualizations
738
- - Markdown report with findings
739
-
740
- Author: BioResearcher AI Agent
741
- Date: 2024-01-15
742
- Version: 1.0.0
743
- """
744
-
745
- import sys
746
- from typing import Dict, List, Any, Optional, Tuple
747
- import pandas as pd
748
- import numpy as np
749
- from scipy import stats
750
-
751
- # Constants
752
- DEFAULT_ALPHA = 0.05
753
- MIN_SAMPLE_SIZE = 10
754
- SUPPORTED_FORMATS = ["json", "csv", "excel"]
755
-
756
- # Private utility functions
757
-
758
- def _validate_dataframe(df: pd.DataFrame, required_cols: List[str]) -> None:
759
- """Validate DataFrame has required columns.
760
-
761
- Args:
762
- df: DataFrame to validate
763
- required_cols: List of required column names
764
-
765
- Raises:
766
- ValueError: If required columns missing
767
- """
768
- missing = [col for col in required_cols if col not in df.columns]
769
- if missing:
770
- raise ValueError(f"Missing required columns: {missing}")
771
-
772
- def _calculate_confidence_interval(
773
- data: np.ndarray,
774
- confidence: float = 0.95
775
- ) -> Tuple[float, float]:
776
- """Calculate confidence interval for data.
777
-
778
- Args:
779
- data: Array of values
780
- confidence: Confidence level (0.0 to 1.0)
781
-
782
- Returns:
783
- Tuple of (lower_bound, upper_bound)
784
- """
785
- mean = np.mean(data)
786
- sem = stats.sem(data)
787
- interval = sem * stats.t.ppf((1 + confidence) / 2, len(data) - 1)
788
- return (mean - interval, mean + interval)
789
-
790
- # Main analysis functions
791
-
792
- def load_trial_data(
793
- file_path: str,
794
- file_format: str = "excel"
795
- ) -> pd.DataFrame:
796
- """Load trial data from file.
797
-
798
- Args:
799
- file_path: Path to input file
800
- file_format: File format ("excel", "csv", "json")
801
-
802
- Returns:
803
- DataFrame with trial data
804
-
805
- Raises:
806
- FileNotFoundError: If file doesn't exist
807
- ValueError: If format unsupported or file invalid
808
- """
809
- if file_format not in SUPPORTED_FORMATS:
810
- raise ValueError(f"Unsupported format: {file_format}")
811
-
812
- try:
813
- if file_format == "excel":
814
- df = pd.read_excel(file_path)
815
- elif file_format == "csv":
816
- df = pd.read_csv(file_path)
817
- else: # json
818
- df = pd.read_json(file_path)
819
- except FileNotFoundError:
820
- raise FileNotFoundError(f"File not found: {file_path}")
821
- except Exception as e:
822
- raise ValueError(f"Failed to load {file_path}: {e}")
823
-
824
- # Validate structure
825
- required = ["nct_id", "phase", "response_rate"]
826
- _validate_dataframe(df, required)
827
-
828
- return df
829
-
830
- def compare_response_rates(
831
- df: pd.DataFrame,
832
- group_column: str = "phase",
833
- alpha: float = DEFAULT_ALPHA
834
- ) -> Dict[str, Any]:
835
- """Compare response rates across groups with statistical testing.
836
-
837
- Performs chi-square test for independence and calculates
838
- confidence intervals for each group.
839
-
840
- Args:
841
- df: DataFrame containing trial data
842
- group_column: Column to group by
843
- alpha: Significance level (0.0 to 1.0)
844
-
845
- Returns:
846
- Dictionary containing:
847
- {
848
- "group_stats": Statistics per group,
849
- "chi_square": Chi-square test results,
850
- "significant": Boolean for significance,
851
- "confidence_intervals": CI per group
852
- }
853
-
854
- Raises:
855
- ValueError: If insufficient data or invalid alpha
856
- """
857
- if not 0 < alpha < 1:
858
- raise ValueError(f"alpha must be 0-1, got {alpha}")
859
-
860
- if len(df) < MIN_SAMPLE_SIZE:
861
- raise ValueError(f"Insufficient data: {len(df)} < {MIN_SAMPLE_SIZE}")
862
-
863
- # Group statistics
864
- group_stats = df.groupby(group_column)["response_rate"].agg(
865
- ["mean", "std", "count", "median"]
866
- ).to_dict()
867
-
868
- # Chi-square test
869
- contingency = pd.crosstab(
870
- df[group_column],
871
- (df["response_rate"] > df["response_rate"].median()).astype(int)
872
- )
873
- chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
874
-
875
- # Confidence intervals
876
- ci_results = {}
877
- for group in df[group_column].unique():
878
- group_data = df[df[group_column] == group]["response_rate"].values
879
- ci_low, ci_high = _calculate_confidence_interval(group_data)
880
- ci_results[group] = {"lower": ci_low, "upper": ci_high}
881
-
882
- return {
883
- "group_stats": group_stats,
884
- "chi_square": {
885
- "statistic": chi2,
886
- "p_value": p_value,
887
- "degrees_of_freedom": dof
888
- },
889
- "significant": p_value < alpha,
890
- "confidence_intervals": ci_results
891
- }
892
-
893
- # CLI interface
894
-
895
- def main():
896
- """Command-line interface for trial statistics."""
897
- import argparse
898
-
899
- parser = argparse.ArgumentParser(
900
- description="Clinical trial statistical analysis"
901
- )
902
- parser.add_argument("command", choices=["analyze", "compare"])
903
- parser.add_argument("--input", required=True, help="Input file path")
904
- parser.add_argument("--output", required=True, help="Output directory")
905
- parser.add_argument(
906
- "--format",
907
- choices=SUPPORTED_FORMATS,
908
- default="excel",
909
- help="Input file format"
910
- )
911
-
912
- args = parser.parse_args()
913
-
914
- # Load data
915
- df = load_trial_data(args.input, args.format)
916
-
917
- # Execute command
918
- if args.command == "compare":
919
- results = compare_response_rates(df)
920
- # Export results
921
- import json
922
- with open(f"{args.output}/comparison.json", 'w') as f:
923
- json.dump(results, f, indent=2)
924
- print(f"Results saved to {args.output}/comparison.json")
925
-
926
- if __name__ == "__main__":
927
- main()
928
- ```
929
-
930
- ---
931
-
932
- ## Summary
933
-
934
- **Every Python script must have:**
935
- 1. ✅ Complete module docstring
936
- 2. ✅ Function docstrings with Args/Returns/Raises/Example
937
- 3. ✅ No code duplication (DRY principle)
938
- 4. ✅ Type hints for all functions
939
- 5. ✅ Input validation and error handling
940
- 6. ✅ Single responsibility per function
941
- 7. ✅ Configuration via parameters
942
- 8. ✅ Proper file organization
943
-
944
- **Follow this pattern for all Python code to ensure maintainability, reusability, and professional quality.**
1
+ # Python Code Standards (DRY Principle)
2
+
3
+ Comprehensive documentation requirements and DRY (Don't Repeat Yourself) principle enforcement for all Python scripts.
4
+
5
+ ## Overview
6
+
7
+ This pattern defines mandatory standards for Python code written by the bioresearcher agent:
8
+ - Complete documentation (module and function docstrings)
9
+ - DRY principle enforcement (no code duplication)
10
+ - Type hints and validation
11
+ - Error handling best practices
12
+ - Reusable component design
13
+
14
+ ---
15
+
16
+ ## 1. Module-Level Documentation
17
+
18
+ ### Required Elements
19
+
20
+ **EVERY Python script MUST include:**
21
+
22
+ ```python
23
+ #!/usr/bin/env python3
24
+ """[Script Purpose - One Line Description]
25
+
26
+ This module provides functionality for:
27
+ - [Functionality 1 - brief description]
28
+ - [Functionality 2 - brief description]
29
+ - [Functionality 3 - brief description]
30
+
31
+ Usage:
32
+ uv run python script_name.py [command] [options]
33
+
34
+ Examples:
35
+ uv run python script_name.py process --input data.xlsx --output results.json
36
+ uv run python script_name.py validate --schema schema.json
37
+ uv run python script_name.py analyze --config config.yaml
38
+
39
+ Dependencies:
40
+ - pandas >= 1.5.0
41
+ - openpyxl >= 3.0.0
42
+ - [Other dependencies with version requirements]
43
+
44
+ Configuration:
45
+ [Any configuration files or environment variables needed]
46
+
47
+ Output:
48
+ [Description of output format and location]
49
+
50
+ Author: BioResearcher AI Agent
51
+ Date: YYYY-MM-DD
52
+ Version: 1.0.0
53
+ """
54
+ ```
55
+
56
+ ### Example: Well-Documented Module
57
+
58
+ ```python
59
+ #!/usr/bin/env python3
60
+ """Clinical Trial Response Rate Analysis
61
+
62
+ This module provides functionality for:
63
+ - Loading and validating clinical trial data from Excel/CSV files
64
+ - Calculating response rates across trial phases and conditions
65
+ - Performing statistical comparisons (chi-square, t-test)
66
+ - Generating summary reports with visualizations
67
+
68
+ Usage:
69
+ uv run python trial_analysis.py analyze --input trials.xlsx --output results/
70
+ uv run python trial_analysis.py compare --phase1 "Phase 2" --phase2 "Phase 3"
71
+ uv run python trial_analysis.py report --results results/ --output report.md
72
+
73
+ Dependencies:
74
+ - pandas >= 1.5.0
75
+ - scipy >= 1.9.0
76
+ - openpyxl >= 3.0.0
77
+ - matplotlib >= 3.6.0
78
+
79
+ Configuration:
80
+ Requires env.jsonc with database connection (optional)
81
+
82
+ Output:
83
+ - JSON files with statistical results
84
+ - Excel files with aggregated data
85
+ - Markdown report with findings
86
+
87
+ Author: BioResearcher AI Agent
88
+ Date: 2024-01-15
89
+ Version: 1.0.0
90
+ """
91
+ ```
92
+
93
+ ---
94
+
95
+ ## 2. Function Documentation
96
+
97
+ ### Standard Docstring Format
98
+
99
+ **Required sections:**
100
+ 1. Brief description (one line)
101
+ 2. Extended description (if needed)
102
+ 3. Args (with types and descriptions)
103
+ 4. Returns (with type and description)
104
+ 5. Raises (if applicable)
105
+ 6. Example (recommended)
106
+
107
+ ### Template
108
+
109
+ ```python
110
+ def function_name(
111
+ param1: Type1,
112
+ param2: Type2,
113
+ optional_param: Type3 = default_value
114
+ ) -> ReturnType:
115
+ """Brief one-line description of function.
116
+
117
+ Extended description explaining what the function does,
118
+ why it's needed, and any important behavior notes.
119
+
120
+ Args:
121
+ param1: Description of param1
122
+ param2: Description of param2
123
+ Can span multiple lines for complex parameters
124
+ optional_param: Description of optional parameter
125
+ Default: default_value
126
+
127
+ Returns:
128
+ Description of return value
129
+ For complex returns, describe structure:
130
+ {
131
+ "key1": type - description,
132
+ "key2": type - description
133
+ }
134
+
135
+ Raises:
136
+ ExceptionType: When/why this exception is raised
137
+ AnotherException: Description
138
+
139
+ Example:
140
+ >>> result = function_name(param1="value", param2=42)
141
+ >>> print(result["key"])
142
+ expected_output
143
+ """
144
+ # Implementation
145
+ pass
146
+ ```
147
+
148
+ ### Example: Well-Documented Function
149
+
150
+ ```python
151
+ def analyze_clinical_trials(
152
+ trials_data: List[Dict[str, Any]],
153
+ filter_criteria: Dict[str, str],
154
+ output_format: str = "json"
155
+ ) -> Dict[str, Any]:
156
+ """Analyze clinical trial data with specified filters.
157
+
158
+ Performs comprehensive analysis of clinical trial records including:
159
+ - Filtering by phase, status, and condition
160
+ - Statistical aggregation by sponsor and location
161
+ - Response rate comparison across trial phases
162
+
163
+ Args:
164
+ trials_data: List of trial dictionaries from dbQuery or API
165
+ Each dict must contain: nct_id, phase, status, condition,
166
+ sponsor, response_rate, patient_count
167
+ filter_criteria: Dictionary of filter conditions
168
+ Example: {"phase": "Phase 3", "status": "Recruiting"}
169
+ Supported keys: phase, status, condition, sponsor
170
+ output_format: Output format for results
171
+ Options: "json", "dataframe", "dict"
172
+ Default: "json"
173
+
174
+ Returns:
175
+ Dictionary containing analysis results:
176
+ {
177
+ "filtered_count": int - Number of trials after filtering,
178
+ "statistics": {
179
+ "avg_response_rate": float,
180
+ "total_patients": int,
181
+ "by_phase": Dict[str, Dict]
182
+ },
183
+ "trials": List[Dict] - Filtered trial records
184
+ }
185
+
186
+ Raises:
187
+ ValueError: If filter_criteria contains unsupported keys
188
+ KeyError: If trials_data missing required fields
189
+ TypeError: If trials_data is not a list
190
+
191
+ Example:
192
+ >>> trials = dbQuery("SELECT * FROM clinical_trials")
193
+ >>> results = analyze_clinical_trials(
194
+ ... trials_data=trials,
195
+ ... filter_criteria={"phase": "Phase 3"},
196
+ ... output_format="json"
197
+ ... )
198
+ >>> print(results["filtered_count"])
199
+ 42
200
+ >>> print(results["statistics"]["avg_response_rate"])
201
+ 0.65
202
+ """
203
+ # Validate inputs
204
+ _validate_filter_criteria(filter_criteria)
205
+ validated_data = _validate_trials(trials_data)
206
+
207
+ # Apply filters (reusable function)
208
+ filtered = _apply_filters(validated_data, filter_criteria)
209
+
210
+ # Calculate statistics (reusable function)
211
+ stats = _calculate_statistics(filtered)
212
+
213
+ # Format output
214
+ if output_format == "json":
215
+ return {
216
+ "filtered_count": len(filtered),
217
+ "statistics": stats,
218
+ "trials": filtered
219
+ }
220
+ elif output_format == "dataframe":
221
+ import pandas as pd
222
+ return {
223
+ "filtered_count": len(filtered),
224
+ "statistics": stats,
225
+ "trials": pd.DataFrame(filtered)
226
+ }
227
+ else:
228
+ return {
229
+ "filtered_count": len(filtered),
230
+ "statistics": stats,
231
+ "trials": filtered
232
+ }
233
+ ```
234
+
235
+ ---
236
+
237
+ ## 3. DRY Principle Enforcement
238
+
239
+ ### Core Principles
240
+
241
+ 1. **Single Responsibility** - Each function does ONE thing
242
+ 2. **No Code Duplication** - If code appears twice, extract to function
243
+ 3. **Reusable Components** - Design functions for reuse across scripts
244
+ 4. **Configuration Over Hardcoding** - Use parameters, not hardcoded values
245
+
246
+ ### Violation Examples and Corrections
247
+
248
+ #### Violation 1: Code Duplication
249
+
250
+ ```python
251
+ # ❌ BAD: Duplicated validation logic
252
+ def analyze_trials(trials):
253
+ """Analyze trials with inline validation."""
254
+ for trial in trials:
255
+ if not trial.get("nct_id"):
256
+ raise ValueError("Missing NCT ID")
257
+ if not trial.get("phase"):
258
+ raise ValueError("Missing phase")
259
+ if not trial.get("status"):
260
+ raise ValueError("Missing status")
261
+ # ... analysis logic
262
+
263
+ def export_trials(trials):
264
+ """Export trials with inline validation."""
265
+ for trial in trials:
266
+ if not trial.get("nct_id"):
267
+ raise ValueError("Missing NCT ID")
268
+ if not trial.get("phase"):
269
+ raise ValueError("Missing phase")
270
+ if not trial.get("status"):
271
+ raise ValueError("Missing status")
272
+ # ... export logic
273
+ ```
274
+
275
+ ```python
276
+ # ✅ GOOD: DRY principle - reusable validation function
277
+ def _validate_trial(trial: Dict[str, Any]) -> bool:
278
+ """Validate single trial record.
279
+
280
+ Checks for required fields and data types.
281
+ Reusable across multiple functions.
282
+
283
+ Args:
284
+ trial: Trial dictionary to validate
285
+
286
+ Returns:
287
+ True if valid
288
+
289
+ Raises:
290
+ ValueError: If required fields missing or invalid
291
+ """
292
+ required_fields = {
293
+ "nct_id": str,
294
+ "phase": str,
295
+ "status": str,
296
+ "response_rate": (int, float),
297
+ "patient_count": int
298
+ }
299
+
300
+ for field, expected_type in required_fields.items():
301
+ if not trial.get(field):
302
+ raise ValueError(f"Missing required field: {field}")
303
+ if not isinstance(trial[field], expected_type):
304
+ raise ValueError(
305
+ f"Field {field} has wrong type: "
306
+ f"expected {expected_type}, got {type(trial[field])}"
307
+ )
308
+
309
+ return True
310
+
311
+ def analyze_trials(trials: List[Dict]) -> Dict:
312
+ """Analyze trials with reusable validation."""
313
+ validated = [_validate_trial(t) for t in trials]
314
+ # ... analysis logic
315
+
316
+ def export_trials(trials: List[Dict], output_path: str) -> None:
317
+ """Export trials with reusable validation."""
318
+ validated = [_validate_trial(t) for t in trials]
319
+ # ... export logic
320
+ ```
321
+
322
+ #### Violation 2: Hardcoded Values
323
+
324
+ ```python
325
+ # ❌ BAD: Hardcoded configuration
326
+ def analyze_data(data):
327
+ """Analyze with hardcoded thresholds."""
328
+ if len(data) > 1000:
329
+ print("Large dataset detected")
330
+
331
+ for item in data:
332
+ if item["value"] > 0.5: # Magic number
333
+ item["category"] = "High"
334
+ else:
335
+ item["category"] = "Low"
336
+
337
+ return data
338
+ ```
339
+
340
+ ```python
341
+ # ✅ GOOD: Configuration via parameters
342
+ def analyze_data(
343
+ data: List[Dict],
344
+ large_dataset_threshold: int = 1000,
345
+ category_threshold: float = 0.5
346
+ ) -> List[Dict]:
347
+ """Analyze with configurable thresholds.
348
+
349
+ Args:
350
+ data: List of data items
351
+ large_dataset_threshold: Threshold for large dataset warning
352
+ Default: 1000
353
+ category_threshold: Threshold for High/Low categorization
354
+ Default: 0.5
355
+
356
+ Returns:
357
+ Data with categories assigned
358
+ """
359
+ if len(data) > large_dataset_threshold:
360
+ print(f"Large dataset detected: {len(data)} items")
361
+
362
+ for item in data:
363
+ if item["value"] > category_threshold:
364
+ item["category"] = "High"
365
+ else:
366
+ item["category"] = "Low"
367
+
368
+ return data
369
+ ```
370
+
371
+ #### Violation 3: Multiple Responsibilities
372
+
373
+ ```python
374
+ # ❌ BAD: Function does too many things
375
+ def process_trials(file_path):
376
+ """Load, validate, filter, analyze, and export trials."""
377
+ # Load data
378
+ df = pd.read_excel(file_path)
379
+
380
+ # Validate
381
+ if df.empty:
382
+ raise ValueError("Empty file")
383
+
384
+ # Filter
385
+ df = df[df["phase"] == "Phase 3"]
386
+
387
+ # Analyze
388
+ stats = df.groupby("condition")["response_rate"].mean()
389
+
390
+ # Export
391
+ stats.to_excel("output.xlsx")
392
+
393
+ # Visualize
394
+ stats.plot(kind="bar")
395
+ plt.savefig("plot.png")
396
+
397
+ return stats
398
+ ```
399
+
400
+ ```python
401
+ # ✅ GOOD: Single responsibility per function
402
+ def load_trials(file_path: str) -> pd.DataFrame:
403
+ """Load trials from Excel file."""
404
+ return pd.read_excel(file_path)
405
+
406
+ def validate_trials(df: pd.DataFrame) -> pd.DataFrame:
407
+ """Validate trial data."""
408
+ if df.empty:
409
+ raise ValueError("Empty file")
410
+ required_cols = ["nct_id", "phase", "response_rate"]
411
+ missing = [col for col in required_cols if col not in df.columns]
412
+ if missing:
413
+ raise ValueError(f"Missing columns: {missing}")
414
+ return df
415
+
416
+ def filter_trials(
417
+ df: pd.DataFrame,
418
+ phase: str = None,
419
+ status: str = None
420
+ ) -> pd.DataFrame:
421
+ """Filter trials by criteria."""
422
+ if phase:
423
+ df = df[df["phase"] == phase]
424
+ if status:
425
+ df = df[df["status"] == status]
426
+ return df
427
+
428
+ def analyze_trials(df: pd.DataFrame) -> pd.Series:
429
+ """Calculate statistics by condition."""
430
+ return df.groupby("condition")["response_rate"].mean()
431
+
432
+ def export_results(stats: pd.Series, output_path: str) -> None:
433
+ """Export statistics to Excel."""
434
+ stats.to_excel(output_path)
435
+
436
+ def create_visualization(stats: pd.Series, output_path: str) -> None:
437
+ """Create bar chart visualization."""
438
+ import matplotlib.pyplot as plt
439
+ stats.plot(kind="bar")
440
+ plt.savefig(output_path)
441
+
442
+ # Main workflow
443
+ def process_trials(file_path: str, output_dir: str) -> pd.Series:
444
+ """Process trials using modular functions."""
445
+ df = load_trials(file_path)
446
+ df = validate_trials(df)
447
+ df = filter_trials(df, phase="Phase 3")
448
+ stats = analyze_trials(df)
449
+ export_results(stats, f"{output_dir}/stats.xlsx")
450
+ create_visualization(stats, f"{output_dir}/plot.png")
451
+ return stats
452
+ ```
453
+
454
+ ---
455
+
456
+ ## 4. Type Hints
457
+
458
+ ### Why Type Hints Matter
459
+
460
+ - Improve code readability
461
+ - Enable IDE autocomplete and type checking
462
+ - Document expected types
463
+ - Catch errors early
464
+
465
+ ### Standard Type Hint Patterns
466
+
467
+ ```python
468
+ from typing import Dict, List, Any, Optional, Union, Tuple
469
+
470
+ # Basic types
471
+ def process_string(text: str) -> str:
472
+ pass
473
+
474
+ # Collections
475
+ def process_list(items: List[str]) -> List[int]:
476
+ pass
477
+
478
+ def process_dict(data: Dict[str, Any]) -> Dict[str, float]:
479
+ pass
480
+
481
+ # Optional parameters
482
+ def search(
483
+ query: str,
484
+ limit: Optional[int] = None
485
+ ) -> List[Dict]:
486
+ pass
487
+
488
+ # Union types
489
+ def parse_value(value: Union[str, int, float]) -> float:
490
+ pass
491
+
492
+ # Tuple returns
493
+ def get_stats(data: List[float]) -> Tuple[float, float, float]:
494
+ """Returns (mean, std, median)."""
495
+ pass
496
+
497
+ # Complex structures
498
+ def analyze_trials(
499
+ trials: List[Dict[str, Union[str, int, float]]]
500
+ ) -> Dict[str, Any]:
501
+ pass
502
+ ```
503
+
504
+ ---
505
+
506
+ ## 5. Error Handling
507
+
508
+ ### Validation Pattern
509
+
510
+ ```python
511
+ def analyze_data(data: List[Dict], threshold: float) -> Dict:
512
+ """Analyze data with comprehensive validation.
513
+
514
+ Args:
515
+ data: List of data items
516
+ threshold: Analysis threshold (0.0 to 1.0)
517
+
518
+ Returns:
519
+ Analysis results
520
+
521
+ Raises:
522
+ TypeError: If data is not a list
523
+ ValueError: If threshold out of range or data empty
524
+ """
525
+ # Type validation
526
+ if not isinstance(data, list):
527
+ raise TypeError(f"data must be list, got {type(data)}")
528
+
529
+ if not isinstance(threshold, (int, float)):
530
+ raise TypeError(f"threshold must be numeric, got {type(threshold)}")
531
+
532
+ # Value validation
533
+ if not 0.0 <= threshold <= 1.0:
534
+ raise ValueError(f"threshold must be 0.0-1.0, got {threshold}")
535
+
536
+ if len(data) == 0:
537
+ raise ValueError("data cannot be empty")
538
+
539
+ # Field validation
540
+ for i, item in enumerate(data):
541
+ if "value" not in item:
542
+ raise ValueError(f"Item {i} missing 'value' field")
543
+
544
+ # Analysis logic
545
+ results = [item for item in data if item["value"] > threshold]
546
+
547
+ return {
548
+ "count": len(results),
549
+ "threshold": threshold,
550
+ "items": results
551
+ }
552
+ ```
553
+
554
+ ### Exception Handling Pattern
555
+
556
+ ```python
557
+ def load_and_process(file_path: str) -> Dict:
558
+ """Load and process file with error handling.
559
+
560
+ Args:
561
+ file_path: Path to input file
562
+
563
+ Returns:
564
+ Processed data
565
+
566
+ Raises:
567
+ FileNotFoundError: If file doesn't exist
568
+ ValueError: If file format invalid
569
+ """
570
+ import json
571
+
572
+ # Try-except for external operations
573
+ try:
574
+ with open(file_path, 'r') as f:
575
+ data = json.load(f)
576
+ except FileNotFoundError:
577
+ raise FileNotFoundError(f"File not found: {file_path}")
578
+ except json.JSONDecodeError as e:
579
+ raise ValueError(f"Invalid JSON in {file_path}: {e}")
580
+
581
+ # Validate structure
582
+ if not isinstance(data, dict):
583
+ raise ValueError(f"Expected dict, got {type(data)}")
584
+
585
+ # Process with error handling
586
+ try:
587
+ processed = _process_data(data)
588
+ except Exception as e:
589
+ raise RuntimeError(f"Processing failed: {e}")
590
+
591
+ return processed
592
+ ```
593
+
594
+ ---
595
+
596
+ ## 6. File Organization
597
+
598
+ ### Directory Structure
599
+
600
+ ```
601
+ .scripts/py/
602
+ ├── [topic]_analysis.py # Main analysis script
603
+ ├── [topic]_utils.py # Reusable utilities
604
+ ├── [topic]_config.py # Configuration constants
605
+ └── requirements.txt # Dependencies (if needed)
606
+ ```
607
+
608
+ ### Script Structure
609
+
610
+ ```python
611
+ #!/usr/bin/env python3
612
+ """Module docstring"""
613
+
614
+ # 1. Imports
615
+ import sys
616
+ from typing import Dict, List, Any
617
+ import pandas as pd
618
+
619
+ # 2. Constants/Configuration
620
+ DEFAULT_THRESHOLD = 0.5
621
+ SUPPORTED_FORMATS = ["json", "csv", "excel"]
622
+
623
+ # 3. Utility functions (private, prefix with _)
624
+ def _validate_input(data: Any) -> bool:
625
+ """Private utility function."""
626
+ pass
627
+
628
+ def _format_output(results: Dict) -> str:
629
+ """Private utility function."""
630
+ pass
631
+
632
+ # 4. Main functions (public)
633
+ def analyze_data(data: List[Dict]) -> Dict:
634
+ """Public main function."""
635
+ pass
636
+
637
+ def export_results(results: Dict, output_path: str) -> None:
638
+ """Public utility function."""
639
+ pass
640
+
641
+ # 5. CLI interface (if applicable)
642
+ def main():
643
+ """Command-line interface."""
644
+ import argparse
645
+
646
+ parser = argparse.ArgumentParser(description="...")
647
+ parser.add_argument("command", choices=["analyze", "export"])
648
+ parser.add_argument("--input", required=True)
649
+ parser.add_argument("--output", required=True)
650
+
651
+ args = parser.parse_args()
652
+
653
+ # CLI logic
654
+ if args.command == "analyze":
655
+ data = load_data(args.input)
656
+ results = analyze_data(data)
657
+ export_results(results, args.output)
658
+
659
+ # 6. Main execution block
660
+ if __name__ == "__main__":
661
+ main()
662
+ ```
663
+
664
+ ---
665
+
666
+ ## 7. Code Quality Checklist
667
+
668
+ Before finalizing any Python script, verify:
669
+
670
+ ### Documentation
671
+ - [ ] Module docstring complete (purpose, usage, dependencies)
672
+ - [ ] All functions have docstrings
673
+ - [ ] Args section complete with types
674
+ - [ ] Returns section complete with types
675
+ - [ ] Raises section for functions that can fail
676
+ - [ ] Examples included for complex functions
677
+
678
+ ### DRY Principle
679
+ - [ ] No duplicated code blocks
680
+ - [ ] Repeated logic extracted to functions
681
+ - [ ] Single responsibility per function
682
+ - [ ] Configuration via parameters, not hardcoded values
683
+ - [ ] Utility functions reusable across scripts
684
+
685
+ ### Type Hints
686
+ - [ ] All function parameters have type hints
687
+ - [ ] All return types specified
688
+ - [ ] Complex types imported from typing module
689
+ - [ ] Optional parameters marked as Optional[Type]
690
+
691
+ ### Error Handling
692
+ - [ ] Input validation at function start
693
+ - [ ] Type checking for critical parameters
694
+ - [ ] Try-except for external operations
695
+ - [ ] Informative error messages
696
+ - [ ] Proper exception types
697
+
698
+ ### Code Style
699
+ - [ ] Consistent naming (snake_case for functions/variables)
700
+ - [ ] Functions are < 50 lines (if longer, split)
701
+ - [ ] Logical grouping of related functions
702
+ - [ ] Private functions prefixed with _
703
+ - [ ] No unused imports or variables
704
+
705
+ ---
706
+
707
+ ## 8. Example: Complete Well-Structured Script
708
+
709
+ ```python
710
+ #!/usr/bin/env python3
711
+ """Clinical Trial Statistical Analysis
712
+
713
+ This module provides functionality for:
714
+ - Loading trial data from multiple sources (Excel, CSV, database)
715
+ - Calculating response rates and survival statistics
716
+ - Performing statistical comparisons (chi-square, t-test, log-rank)
717
+ - Generating comprehensive reports with visualizations
718
+
719
+ Usage:
720
+ uv run python trial_stats.py analyze --input trials.xlsx --output results/
721
+ uv run python trial_stats.py compare --group1 Phase2 --group2 Phase3
722
+ uv run python trial_stats.py report --results results/ --output report.md
723
+
724
+ Dependencies:
725
+ - pandas >= 1.5.0
726
+ - scipy >= 1.9.0
727
+ - openpyxl >= 3.0.0
728
+ - matplotlib >= 3.6.0
729
+ - lifelines >= 0.27.0
730
+
731
+ Configuration:
732
+ env.jsonc for database connection (optional)
733
+
734
+ Output:
735
+ - JSON files with statistical results
736
+ - Excel files with aggregated data
737
+ - PNG files with visualizations
738
+ - Markdown report with findings
739
+
740
+ Author: BioResearcher AI Agent
741
+ Date: 2024-01-15
742
+ Version: 1.0.0
743
+ """
744
+
745
+ import sys
746
+ from typing import Dict, List, Any, Optional, Tuple
747
+ import pandas as pd
748
+ import numpy as np
749
+ from scipy import stats
750
+
751
+ # Constants
752
+ DEFAULT_ALPHA = 0.05
753
+ MIN_SAMPLE_SIZE = 10
754
+ SUPPORTED_FORMATS = ["json", "csv", "excel"]
755
+
756
+ # Private utility functions
757
+
758
+ def _validate_dataframe(df: pd.DataFrame, required_cols: List[str]) -> None:
759
+ """Validate DataFrame has required columns.
760
+
761
+ Args:
762
+ df: DataFrame to validate
763
+ required_cols: List of required column names
764
+
765
+ Raises:
766
+ ValueError: If required columns missing
767
+ """
768
+ missing = [col for col in required_cols if col not in df.columns]
769
+ if missing:
770
+ raise ValueError(f"Missing required columns: {missing}")
771
+
772
+ def _calculate_confidence_interval(
773
+ data: np.ndarray,
774
+ confidence: float = 0.95
775
+ ) -> Tuple[float, float]:
776
+ """Calculate confidence interval for data.
777
+
778
+ Args:
779
+ data: Array of values
780
+ confidence: Confidence level (0.0 to 1.0)
781
+
782
+ Returns:
783
+ Tuple of (lower_bound, upper_bound)
784
+ """
785
+ mean = np.mean(data)
786
+ sem = stats.sem(data)
787
+ interval = sem * stats.t.ppf((1 + confidence) / 2, len(data) - 1)
788
+ return (mean - interval, mean + interval)
789
+
790
+ # Main analysis functions
791
+
792
+ def load_trial_data(
793
+ file_path: str,
794
+ file_format: str = "excel"
795
+ ) -> pd.DataFrame:
796
+ """Load trial data from file.
797
+
798
+ Args:
799
+ file_path: Path to input file
800
+ file_format: File format ("excel", "csv", "json")
801
+
802
+ Returns:
803
+ DataFrame with trial data
804
+
805
+ Raises:
806
+ FileNotFoundError: If file doesn't exist
807
+ ValueError: If format unsupported or file invalid
808
+ """
809
+ if file_format not in SUPPORTED_FORMATS:
810
+ raise ValueError(f"Unsupported format: {file_format}")
811
+
812
+ try:
813
+ if file_format == "excel":
814
+ df = pd.read_excel(file_path)
815
+ elif file_format == "csv":
816
+ df = pd.read_csv(file_path)
817
+ else: # json
818
+ df = pd.read_json(file_path)
819
+ except FileNotFoundError:
820
+ raise FileNotFoundError(f"File not found: {file_path}")
821
+ except Exception as e:
822
+ raise ValueError(f"Failed to load {file_path}: {e}")
823
+
824
+ # Validate structure
825
+ required = ["nct_id", "phase", "response_rate"]
826
+ _validate_dataframe(df, required)
827
+
828
+ return df
829
+
830
+ def compare_response_rates(
831
+ df: pd.DataFrame,
832
+ group_column: str = "phase",
833
+ alpha: float = DEFAULT_ALPHA
834
+ ) -> Dict[str, Any]:
835
+ """Compare response rates across groups with statistical testing.
836
+
837
+ Performs chi-square test for independence and calculates
838
+ confidence intervals for each group.
839
+
840
+ Args:
841
+ df: DataFrame containing trial data
842
+ group_column: Column to group by
843
+ alpha: Significance level (0.0 to 1.0)
844
+
845
+ Returns:
846
+ Dictionary containing:
847
+ {
848
+ "group_stats": Statistics per group,
849
+ "chi_square": Chi-square test results,
850
+ "significant": Boolean for significance,
851
+ "confidence_intervals": CI per group
852
+ }
853
+
854
+ Raises:
855
+ ValueError: If insufficient data or invalid alpha
856
+ """
857
+ if not 0 < alpha < 1:
858
+ raise ValueError(f"alpha must be 0-1, got {alpha}")
859
+
860
+ if len(df) < MIN_SAMPLE_SIZE:
861
+ raise ValueError(f"Insufficient data: {len(df)} < {MIN_SAMPLE_SIZE}")
862
+
863
+ # Group statistics
864
+ group_stats = df.groupby(group_column)["response_rate"].agg(
865
+ ["mean", "std", "count", "median"]
866
+ ).to_dict()
867
+
868
+ # Chi-square test
869
+ contingency = pd.crosstab(
870
+ df[group_column],
871
+ (df["response_rate"] > df["response_rate"].median()).astype(int)
872
+ )
873
+ chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
874
+
875
+ # Confidence intervals
876
+ ci_results = {}
877
+ for group in df[group_column].unique():
878
+ group_data = df[df[group_column] == group]["response_rate"].values
879
+ ci_low, ci_high = _calculate_confidence_interval(group_data)
880
+ ci_results[group] = {"lower": ci_low, "upper": ci_high}
881
+
882
+ return {
883
+ "group_stats": group_stats,
884
+ "chi_square": {
885
+ "statistic": chi2,
886
+ "p_value": p_value,
887
+ "degrees_of_freedom": dof
888
+ },
889
+ "significant": p_value < alpha,
890
+ "confidence_intervals": ci_results
891
+ }
892
+
893
+ # CLI interface
894
+
895
+ def main():
896
+ """Command-line interface for trial statistics."""
897
+ import argparse
898
+
899
+ parser = argparse.ArgumentParser(
900
+ description="Clinical trial statistical analysis"
901
+ )
902
+ parser.add_argument("command", choices=["analyze", "compare"])
903
+ parser.add_argument("--input", required=True, help="Input file path")
904
+ parser.add_argument("--output", required=True, help="Output directory")
905
+ parser.add_argument(
906
+ "--format",
907
+ choices=SUPPORTED_FORMATS,
908
+ default="excel",
909
+ help="Input file format"
910
+ )
911
+
912
+ args = parser.parse_args()
913
+
914
+ # Load data
915
+ df = load_trial_data(args.input, args.format)
916
+
917
+ # Execute command
918
+ if args.command == "compare":
919
+ results = compare_response_rates(df)
920
+ # Export results
921
+ import json
922
+ with open(f"{args.output}/comparison.json", 'w') as f:
923
+ json.dump(results, f, indent=2)
924
+ print(f"Results saved to {args.output}/comparison.json")
925
+
926
+ if __name__ == "__main__":
927
+ main()
928
+ ```
929
+
930
+ ---
931
+
932
+ ## Summary
933
+
934
+ **Every Python script must have:**
935
+ 1. ✅ Complete module docstring
936
+ 2. ✅ Function docstrings with Args/Returns/Raises/Example
937
+ 3. ✅ No code duplication (DRY principle)
938
+ 4. ✅ Type hints for all functions
939
+ 5. ✅ Input validation and error handling
940
+ 6. ✅ Single responsibility per function
941
+ 7. ✅ Configuration via parameters
942
+ 8. ✅ Proper file organization
943
+
944
+ **Follow this pattern for all Python code to ensure maintainability, reusability, and professional quality.**