@yeyuan98/opencode-bioresearcher-plugin 1.5.2 → 1.5.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +1 -0
- package/dist/agents/bioresearcher/prompt.js +235 -235
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/analysis-methods.md +551 -551
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/best-practices.md +647 -647
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/python-standards.md +944 -944
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/report-template.md +613 -613
- package/dist/skills/bioresearcher-core/patterns/bioresearcher/tool-selection.md +481 -481
- package/dist/skills/bioresearcher-core/patterns/citations.md +234 -234
- package/dist/skills/bioresearcher-core/patterns/rate-limiting.md +167 -167
- package/dist/skills/gromacs-guides/SKILL.md +48 -0
- package/dist/skills/gromacs-guides/guides/create_index.md +96 -0
- package/dist/skills/gromacs-guides/guides/inspect_tpr.md +93 -0
- package/package.json +1 -1
|
@@ -1,944 +1,944 @@
|
|
|
1
|
-
# Python Code Standards (DRY Principle)
|
|
2
|
-
|
|
3
|
-
Comprehensive documentation requirements and DRY (Don't Repeat Yourself) principle enforcement for all Python scripts.
|
|
4
|
-
|
|
5
|
-
## Overview
|
|
6
|
-
|
|
7
|
-
This pattern defines mandatory standards for Python code written by the bioresearcher agent:
|
|
8
|
-
- Complete documentation (module and function docstrings)
|
|
9
|
-
- DRY principle enforcement (no code duplication)
|
|
10
|
-
- Type hints and validation
|
|
11
|
-
- Error handling best practices
|
|
12
|
-
- Reusable component design
|
|
13
|
-
|
|
14
|
-
---
|
|
15
|
-
|
|
16
|
-
## 1. Module-Level Documentation
|
|
17
|
-
|
|
18
|
-
### Required Elements
|
|
19
|
-
|
|
20
|
-
**EVERY Python script MUST include:**
|
|
21
|
-
|
|
22
|
-
```python
|
|
23
|
-
#!/usr/bin/env python3
|
|
24
|
-
"""[Script Purpose - One Line Description]
|
|
25
|
-
|
|
26
|
-
This module provides functionality for:
|
|
27
|
-
- [Functionality 1 - brief description]
|
|
28
|
-
- [Functionality 2 - brief description]
|
|
29
|
-
- [Functionality 3 - brief description]
|
|
30
|
-
|
|
31
|
-
Usage:
|
|
32
|
-
uv run python script_name.py [command] [options]
|
|
33
|
-
|
|
34
|
-
Examples:
|
|
35
|
-
uv run python script_name.py process --input data.xlsx --output results.json
|
|
36
|
-
uv run python script_name.py validate --schema schema.json
|
|
37
|
-
uv run python script_name.py analyze --config config.yaml
|
|
38
|
-
|
|
39
|
-
Dependencies:
|
|
40
|
-
- pandas >= 1.5.0
|
|
41
|
-
- openpyxl >= 3.0.0
|
|
42
|
-
- [Other dependencies with version requirements]
|
|
43
|
-
|
|
44
|
-
Configuration:
|
|
45
|
-
[Any configuration files or environment variables needed]
|
|
46
|
-
|
|
47
|
-
Output:
|
|
48
|
-
[Description of output format and location]
|
|
49
|
-
|
|
50
|
-
Author: BioResearcher AI Agent
|
|
51
|
-
Date: YYYY-MM-DD
|
|
52
|
-
Version: 1.0.0
|
|
53
|
-
"""
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
### Example: Well-Documented Module
|
|
57
|
-
|
|
58
|
-
```python
|
|
59
|
-
#!/usr/bin/env python3
|
|
60
|
-
"""Clinical Trial Response Rate Analysis
|
|
61
|
-
|
|
62
|
-
This module provides functionality for:
|
|
63
|
-
- Loading and validating clinical trial data from Excel/CSV files
|
|
64
|
-
- Calculating response rates across trial phases and conditions
|
|
65
|
-
- Performing statistical comparisons (chi-square, t-test)
|
|
66
|
-
- Generating summary reports with visualizations
|
|
67
|
-
|
|
68
|
-
Usage:
|
|
69
|
-
uv run python trial_analysis.py analyze --input trials.xlsx --output results/
|
|
70
|
-
uv run python trial_analysis.py compare --phase1 "Phase 2" --phase2 "Phase 3"
|
|
71
|
-
uv run python trial_analysis.py report --results results/ --output report.md
|
|
72
|
-
|
|
73
|
-
Dependencies:
|
|
74
|
-
- pandas >= 1.5.0
|
|
75
|
-
- scipy >= 1.9.0
|
|
76
|
-
- openpyxl >= 3.0.0
|
|
77
|
-
- matplotlib >= 3.6.0
|
|
78
|
-
|
|
79
|
-
Configuration:
|
|
80
|
-
Requires env.jsonc with database connection (optional)
|
|
81
|
-
|
|
82
|
-
Output:
|
|
83
|
-
- JSON files with statistical results
|
|
84
|
-
- Excel files with aggregated data
|
|
85
|
-
- Markdown report with findings
|
|
86
|
-
|
|
87
|
-
Author: BioResearcher AI Agent
|
|
88
|
-
Date: 2024-01-15
|
|
89
|
-
Version: 1.0.0
|
|
90
|
-
"""
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
---
|
|
94
|
-
|
|
95
|
-
## 2. Function Documentation
|
|
96
|
-
|
|
97
|
-
### Standard Docstring Format
|
|
98
|
-
|
|
99
|
-
**Required sections:**
|
|
100
|
-
1. Brief description (one line)
|
|
101
|
-
2. Extended description (if needed)
|
|
102
|
-
3. Args (with types and descriptions)
|
|
103
|
-
4. Returns (with type and description)
|
|
104
|
-
5. Raises (if applicable)
|
|
105
|
-
6. Example (recommended)
|
|
106
|
-
|
|
107
|
-
### Template
|
|
108
|
-
|
|
109
|
-
```python
|
|
110
|
-
def function_name(
|
|
111
|
-
param1: Type1,
|
|
112
|
-
param2: Type2,
|
|
113
|
-
optional_param: Type3 = default_value
|
|
114
|
-
) -> ReturnType:
|
|
115
|
-
"""Brief one-line description of function.
|
|
116
|
-
|
|
117
|
-
Extended description explaining what the function does,
|
|
118
|
-
why it's needed, and any important behavior notes.
|
|
119
|
-
|
|
120
|
-
Args:
|
|
121
|
-
param1: Description of param1
|
|
122
|
-
param2: Description of param2
|
|
123
|
-
Can span multiple lines for complex parameters
|
|
124
|
-
optional_param: Description of optional parameter
|
|
125
|
-
Default: default_value
|
|
126
|
-
|
|
127
|
-
Returns:
|
|
128
|
-
Description of return value
|
|
129
|
-
For complex returns, describe structure:
|
|
130
|
-
{
|
|
131
|
-
"key1": type - description,
|
|
132
|
-
"key2": type - description
|
|
133
|
-
}
|
|
134
|
-
|
|
135
|
-
Raises:
|
|
136
|
-
ExceptionType: When/why this exception is raised
|
|
137
|
-
AnotherException: Description
|
|
138
|
-
|
|
139
|
-
Example:
|
|
140
|
-
>>> result = function_name(param1="value", param2=42)
|
|
141
|
-
>>> print(result["key"])
|
|
142
|
-
expected_output
|
|
143
|
-
"""
|
|
144
|
-
# Implementation
|
|
145
|
-
pass
|
|
146
|
-
```
|
|
147
|
-
|
|
148
|
-
### Example: Well-Documented Function
|
|
149
|
-
|
|
150
|
-
```python
|
|
151
|
-
def analyze_clinical_trials(
|
|
152
|
-
trials_data: List[Dict[str, Any]],
|
|
153
|
-
filter_criteria: Dict[str, str],
|
|
154
|
-
output_format: str = "json"
|
|
155
|
-
) -> Dict[str, Any]:
|
|
156
|
-
"""Analyze clinical trial data with specified filters.
|
|
157
|
-
|
|
158
|
-
Performs comprehensive analysis of clinical trial records including:
|
|
159
|
-
- Filtering by phase, status, and condition
|
|
160
|
-
- Statistical aggregation by sponsor and location
|
|
161
|
-
- Response rate comparison across trial phases
|
|
162
|
-
|
|
163
|
-
Args:
|
|
164
|
-
trials_data: List of trial dictionaries from dbQuery or API
|
|
165
|
-
Each dict must contain: nct_id, phase, status, condition,
|
|
166
|
-
sponsor, response_rate, patient_count
|
|
167
|
-
filter_criteria: Dictionary of filter conditions
|
|
168
|
-
Example: {"phase": "Phase 3", "status": "Recruiting"}
|
|
169
|
-
Supported keys: phase, status, condition, sponsor
|
|
170
|
-
output_format: Output format for results
|
|
171
|
-
Options: "json", "dataframe", "dict"
|
|
172
|
-
Default: "json"
|
|
173
|
-
|
|
174
|
-
Returns:
|
|
175
|
-
Dictionary containing analysis results:
|
|
176
|
-
{
|
|
177
|
-
"filtered_count": int - Number of trials after filtering,
|
|
178
|
-
"statistics": {
|
|
179
|
-
"avg_response_rate": float,
|
|
180
|
-
"total_patients": int,
|
|
181
|
-
"by_phase": Dict[str, Dict]
|
|
182
|
-
},
|
|
183
|
-
"trials": List[Dict] - Filtered trial records
|
|
184
|
-
}
|
|
185
|
-
|
|
186
|
-
Raises:
|
|
187
|
-
ValueError: If filter_criteria contains unsupported keys
|
|
188
|
-
KeyError: If trials_data missing required fields
|
|
189
|
-
TypeError: If trials_data is not a list
|
|
190
|
-
|
|
191
|
-
Example:
|
|
192
|
-
>>> trials = dbQuery("SELECT * FROM clinical_trials")
|
|
193
|
-
>>> results = analyze_clinical_trials(
|
|
194
|
-
... trials_data=trials,
|
|
195
|
-
... filter_criteria={"phase": "Phase 3"},
|
|
196
|
-
... output_format="json"
|
|
197
|
-
... )
|
|
198
|
-
>>> print(results["filtered_count"])
|
|
199
|
-
42
|
|
200
|
-
>>> print(results["statistics"]["avg_response_rate"])
|
|
201
|
-
0.65
|
|
202
|
-
"""
|
|
203
|
-
# Validate inputs
|
|
204
|
-
_validate_filter_criteria(filter_criteria)
|
|
205
|
-
validated_data = _validate_trials(trials_data)
|
|
206
|
-
|
|
207
|
-
# Apply filters (reusable function)
|
|
208
|
-
filtered = _apply_filters(validated_data, filter_criteria)
|
|
209
|
-
|
|
210
|
-
# Calculate statistics (reusable function)
|
|
211
|
-
stats = _calculate_statistics(filtered)
|
|
212
|
-
|
|
213
|
-
# Format output
|
|
214
|
-
if output_format == "json":
|
|
215
|
-
return {
|
|
216
|
-
"filtered_count": len(filtered),
|
|
217
|
-
"statistics": stats,
|
|
218
|
-
"trials": filtered
|
|
219
|
-
}
|
|
220
|
-
elif output_format == "dataframe":
|
|
221
|
-
import pandas as pd
|
|
222
|
-
return {
|
|
223
|
-
"filtered_count": len(filtered),
|
|
224
|
-
"statistics": stats,
|
|
225
|
-
"trials": pd.DataFrame(filtered)
|
|
226
|
-
}
|
|
227
|
-
else:
|
|
228
|
-
return {
|
|
229
|
-
"filtered_count": len(filtered),
|
|
230
|
-
"statistics": stats,
|
|
231
|
-
"trials": filtered
|
|
232
|
-
}
|
|
233
|
-
```
|
|
234
|
-
|
|
235
|
-
---
|
|
236
|
-
|
|
237
|
-
## 3. DRY Principle Enforcement
|
|
238
|
-
|
|
239
|
-
### Core Principles
|
|
240
|
-
|
|
241
|
-
1. **Single Responsibility** - Each function does ONE thing
|
|
242
|
-
2. **No Code Duplication** - If code appears twice, extract to function
|
|
243
|
-
3. **Reusable Components** - Design functions for reuse across scripts
|
|
244
|
-
4. **Configuration Over Hardcoding** - Use parameters, not hardcoded values
|
|
245
|
-
|
|
246
|
-
### Violation Examples and Corrections
|
|
247
|
-
|
|
248
|
-
#### Violation 1: Code Duplication
|
|
249
|
-
|
|
250
|
-
```python
|
|
251
|
-
# ❌ BAD: Duplicated validation logic
|
|
252
|
-
def analyze_trials(trials):
|
|
253
|
-
"""Analyze trials with inline validation."""
|
|
254
|
-
for trial in trials:
|
|
255
|
-
if not trial.get("nct_id"):
|
|
256
|
-
raise ValueError("Missing NCT ID")
|
|
257
|
-
if not trial.get("phase"):
|
|
258
|
-
raise ValueError("Missing phase")
|
|
259
|
-
if not trial.get("status"):
|
|
260
|
-
raise ValueError("Missing status")
|
|
261
|
-
# ... analysis logic
|
|
262
|
-
|
|
263
|
-
def export_trials(trials):
|
|
264
|
-
"""Export trials with inline validation."""
|
|
265
|
-
for trial in trials:
|
|
266
|
-
if not trial.get("nct_id"):
|
|
267
|
-
raise ValueError("Missing NCT ID")
|
|
268
|
-
if not trial.get("phase"):
|
|
269
|
-
raise ValueError("Missing phase")
|
|
270
|
-
if not trial.get("status"):
|
|
271
|
-
raise ValueError("Missing status")
|
|
272
|
-
# ... export logic
|
|
273
|
-
```
|
|
274
|
-
|
|
275
|
-
```python
|
|
276
|
-
# ✅ GOOD: DRY principle - reusable validation function
|
|
277
|
-
def _validate_trial(trial: Dict[str, Any]) -> bool:
|
|
278
|
-
"""Validate single trial record.
|
|
279
|
-
|
|
280
|
-
Checks for required fields and data types.
|
|
281
|
-
Reusable across multiple functions.
|
|
282
|
-
|
|
283
|
-
Args:
|
|
284
|
-
trial: Trial dictionary to validate
|
|
285
|
-
|
|
286
|
-
Returns:
|
|
287
|
-
True if valid
|
|
288
|
-
|
|
289
|
-
Raises:
|
|
290
|
-
ValueError: If required fields missing or invalid
|
|
291
|
-
"""
|
|
292
|
-
required_fields = {
|
|
293
|
-
"nct_id": str,
|
|
294
|
-
"phase": str,
|
|
295
|
-
"status": str,
|
|
296
|
-
"response_rate": (int, float),
|
|
297
|
-
"patient_count": int
|
|
298
|
-
}
|
|
299
|
-
|
|
300
|
-
for field, expected_type in required_fields.items():
|
|
301
|
-
if not trial.get(field):
|
|
302
|
-
raise ValueError(f"Missing required field: {field}")
|
|
303
|
-
if not isinstance(trial[field], expected_type):
|
|
304
|
-
raise ValueError(
|
|
305
|
-
f"Field {field} has wrong type: "
|
|
306
|
-
f"expected {expected_type}, got {type(trial[field])}"
|
|
307
|
-
)
|
|
308
|
-
|
|
309
|
-
return True
|
|
310
|
-
|
|
311
|
-
def analyze_trials(trials: List[Dict]) -> Dict:
|
|
312
|
-
"""Analyze trials with reusable validation."""
|
|
313
|
-
validated = [_validate_trial(t) for t in trials]
|
|
314
|
-
# ... analysis logic
|
|
315
|
-
|
|
316
|
-
def export_trials(trials: List[Dict], output_path: str) -> None:
|
|
317
|
-
"""Export trials with reusable validation."""
|
|
318
|
-
validated = [_validate_trial(t) for t in trials]
|
|
319
|
-
# ... export logic
|
|
320
|
-
```
|
|
321
|
-
|
|
322
|
-
#### Violation 2: Hardcoded Values
|
|
323
|
-
|
|
324
|
-
```python
|
|
325
|
-
# ❌ BAD: Hardcoded configuration
|
|
326
|
-
def analyze_data(data):
|
|
327
|
-
"""Analyze with hardcoded thresholds."""
|
|
328
|
-
if len(data) > 1000:
|
|
329
|
-
print("Large dataset detected")
|
|
330
|
-
|
|
331
|
-
for item in data:
|
|
332
|
-
if item["value"] > 0.5: # Magic number
|
|
333
|
-
item["category"] = "High"
|
|
334
|
-
else:
|
|
335
|
-
item["category"] = "Low"
|
|
336
|
-
|
|
337
|
-
return data
|
|
338
|
-
```
|
|
339
|
-
|
|
340
|
-
```python
|
|
341
|
-
# ✅ GOOD: Configuration via parameters
|
|
342
|
-
def analyze_data(
|
|
343
|
-
data: List[Dict],
|
|
344
|
-
large_dataset_threshold: int = 1000,
|
|
345
|
-
category_threshold: float = 0.5
|
|
346
|
-
) -> List[Dict]:
|
|
347
|
-
"""Analyze with configurable thresholds.
|
|
348
|
-
|
|
349
|
-
Args:
|
|
350
|
-
data: List of data items
|
|
351
|
-
large_dataset_threshold: Threshold for large dataset warning
|
|
352
|
-
Default: 1000
|
|
353
|
-
category_threshold: Threshold for High/Low categorization
|
|
354
|
-
Default: 0.5
|
|
355
|
-
|
|
356
|
-
Returns:
|
|
357
|
-
Data with categories assigned
|
|
358
|
-
"""
|
|
359
|
-
if len(data) > large_dataset_threshold:
|
|
360
|
-
print(f"Large dataset detected: {len(data)} items")
|
|
361
|
-
|
|
362
|
-
for item in data:
|
|
363
|
-
if item["value"] > category_threshold:
|
|
364
|
-
item["category"] = "High"
|
|
365
|
-
else:
|
|
366
|
-
item["category"] = "Low"
|
|
367
|
-
|
|
368
|
-
return data
|
|
369
|
-
```
|
|
370
|
-
|
|
371
|
-
#### Violation 3: Multiple Responsibilities
|
|
372
|
-
|
|
373
|
-
```python
|
|
374
|
-
# ❌ BAD: Function does too many things
|
|
375
|
-
def process_trials(file_path):
|
|
376
|
-
"""Load, validate, filter, analyze, and export trials."""
|
|
377
|
-
# Load data
|
|
378
|
-
df = pd.read_excel(file_path)
|
|
379
|
-
|
|
380
|
-
# Validate
|
|
381
|
-
if df.empty:
|
|
382
|
-
raise ValueError("Empty file")
|
|
383
|
-
|
|
384
|
-
# Filter
|
|
385
|
-
df = df[df["phase"] == "Phase 3"]
|
|
386
|
-
|
|
387
|
-
# Analyze
|
|
388
|
-
stats = df.groupby("condition")["response_rate"].mean()
|
|
389
|
-
|
|
390
|
-
# Export
|
|
391
|
-
stats.to_excel("output.xlsx")
|
|
392
|
-
|
|
393
|
-
# Visualize
|
|
394
|
-
stats.plot(kind="bar")
|
|
395
|
-
plt.savefig("plot.png")
|
|
396
|
-
|
|
397
|
-
return stats
|
|
398
|
-
```
|
|
399
|
-
|
|
400
|
-
```python
|
|
401
|
-
# ✅ GOOD: Single responsibility per function
|
|
402
|
-
def load_trials(file_path: str) -> pd.DataFrame:
|
|
403
|
-
"""Load trials from Excel file."""
|
|
404
|
-
return pd.read_excel(file_path)
|
|
405
|
-
|
|
406
|
-
def validate_trials(df: pd.DataFrame) -> pd.DataFrame:
|
|
407
|
-
"""Validate trial data."""
|
|
408
|
-
if df.empty:
|
|
409
|
-
raise ValueError("Empty file")
|
|
410
|
-
required_cols = ["nct_id", "phase", "response_rate"]
|
|
411
|
-
missing = [col for col in required_cols if col not in df.columns]
|
|
412
|
-
if missing:
|
|
413
|
-
raise ValueError(f"Missing columns: {missing}")
|
|
414
|
-
return df
|
|
415
|
-
|
|
416
|
-
def filter_trials(
|
|
417
|
-
df: pd.DataFrame,
|
|
418
|
-
phase: str = None,
|
|
419
|
-
status: str = None
|
|
420
|
-
) -> pd.DataFrame:
|
|
421
|
-
"""Filter trials by criteria."""
|
|
422
|
-
if phase:
|
|
423
|
-
df = df[df["phase"] == phase]
|
|
424
|
-
if status:
|
|
425
|
-
df = df[df["status"] == status]
|
|
426
|
-
return df
|
|
427
|
-
|
|
428
|
-
def analyze_trials(df: pd.DataFrame) -> pd.Series:
|
|
429
|
-
"""Calculate statistics by condition."""
|
|
430
|
-
return df.groupby("condition")["response_rate"].mean()
|
|
431
|
-
|
|
432
|
-
def export_results(stats: pd.Series, output_path: str) -> None:
|
|
433
|
-
"""Export statistics to Excel."""
|
|
434
|
-
stats.to_excel(output_path)
|
|
435
|
-
|
|
436
|
-
def create_visualization(stats: pd.Series, output_path: str) -> None:
|
|
437
|
-
"""Create bar chart visualization."""
|
|
438
|
-
import matplotlib.pyplot as plt
|
|
439
|
-
stats.plot(kind="bar")
|
|
440
|
-
plt.savefig(output_path)
|
|
441
|
-
|
|
442
|
-
# Main workflow
|
|
443
|
-
def process_trials(file_path: str, output_dir: str) -> pd.Series:
|
|
444
|
-
"""Process trials using modular functions."""
|
|
445
|
-
df = load_trials(file_path)
|
|
446
|
-
df = validate_trials(df)
|
|
447
|
-
df = filter_trials(df, phase="Phase 3")
|
|
448
|
-
stats = analyze_trials(df)
|
|
449
|
-
export_results(stats, f"{output_dir}/stats.xlsx")
|
|
450
|
-
create_visualization(stats, f"{output_dir}/plot.png")
|
|
451
|
-
return stats
|
|
452
|
-
```
|
|
453
|
-
|
|
454
|
-
---
|
|
455
|
-
|
|
456
|
-
## 4. Type Hints
|
|
457
|
-
|
|
458
|
-
### Why Type Hints Matter
|
|
459
|
-
|
|
460
|
-
- Improve code readability
|
|
461
|
-
- Enable IDE autocomplete and type checking
|
|
462
|
-
- Document expected types
|
|
463
|
-
- Catch errors early
|
|
464
|
-
|
|
465
|
-
### Standard Type Hint Patterns
|
|
466
|
-
|
|
467
|
-
```python
|
|
468
|
-
from typing import Dict, List, Any, Optional, Union, Tuple
|
|
469
|
-
|
|
470
|
-
# Basic types
|
|
471
|
-
def process_string(text: str) -> str:
|
|
472
|
-
pass
|
|
473
|
-
|
|
474
|
-
# Collections
|
|
475
|
-
def process_list(items: List[str]) -> List[int]:
|
|
476
|
-
pass
|
|
477
|
-
|
|
478
|
-
def process_dict(data: Dict[str, Any]) -> Dict[str, float]:
|
|
479
|
-
pass
|
|
480
|
-
|
|
481
|
-
# Optional parameters
|
|
482
|
-
def search(
|
|
483
|
-
query: str,
|
|
484
|
-
limit: Optional[int] = None
|
|
485
|
-
) -> List[Dict]:
|
|
486
|
-
pass
|
|
487
|
-
|
|
488
|
-
# Union types
|
|
489
|
-
def parse_value(value: Union[str, int, float]) -> float:
|
|
490
|
-
pass
|
|
491
|
-
|
|
492
|
-
# Tuple returns
|
|
493
|
-
def get_stats(data: List[float]) -> Tuple[float, float, float]:
|
|
494
|
-
"""Returns (mean, std, median)."""
|
|
495
|
-
pass
|
|
496
|
-
|
|
497
|
-
# Complex structures
|
|
498
|
-
def analyze_trials(
|
|
499
|
-
trials: List[Dict[str, Union[str, int, float]]]
|
|
500
|
-
) -> Dict[str, Any]:
|
|
501
|
-
pass
|
|
502
|
-
```
|
|
503
|
-
|
|
504
|
-
---
|
|
505
|
-
|
|
506
|
-
## 5. Error Handling
|
|
507
|
-
|
|
508
|
-
### Validation Pattern
|
|
509
|
-
|
|
510
|
-
```python
|
|
511
|
-
def analyze_data(data: List[Dict], threshold: float) -> Dict:
|
|
512
|
-
"""Analyze data with comprehensive validation.
|
|
513
|
-
|
|
514
|
-
Args:
|
|
515
|
-
data: List of data items
|
|
516
|
-
threshold: Analysis threshold (0.0 to 1.0)
|
|
517
|
-
|
|
518
|
-
Returns:
|
|
519
|
-
Analysis results
|
|
520
|
-
|
|
521
|
-
Raises:
|
|
522
|
-
TypeError: If data is not a list
|
|
523
|
-
ValueError: If threshold out of range or data empty
|
|
524
|
-
"""
|
|
525
|
-
# Type validation
|
|
526
|
-
if not isinstance(data, list):
|
|
527
|
-
raise TypeError(f"data must be list, got {type(data)}")
|
|
528
|
-
|
|
529
|
-
if not isinstance(threshold, (int, float)):
|
|
530
|
-
raise TypeError(f"threshold must be numeric, got {type(threshold)}")
|
|
531
|
-
|
|
532
|
-
# Value validation
|
|
533
|
-
if not 0.0 <= threshold <= 1.0:
|
|
534
|
-
raise ValueError(f"threshold must be 0.0-1.0, got {threshold}")
|
|
535
|
-
|
|
536
|
-
if len(data) == 0:
|
|
537
|
-
raise ValueError("data cannot be empty")
|
|
538
|
-
|
|
539
|
-
# Field validation
|
|
540
|
-
for i, item in enumerate(data):
|
|
541
|
-
if "value" not in item:
|
|
542
|
-
raise ValueError(f"Item {i} missing 'value' field")
|
|
543
|
-
|
|
544
|
-
# Analysis logic
|
|
545
|
-
results = [item for item in data if item["value"] > threshold]
|
|
546
|
-
|
|
547
|
-
return {
|
|
548
|
-
"count": len(results),
|
|
549
|
-
"threshold": threshold,
|
|
550
|
-
"items": results
|
|
551
|
-
}
|
|
552
|
-
```
|
|
553
|
-
|
|
554
|
-
### Exception Handling Pattern
|
|
555
|
-
|
|
556
|
-
```python
|
|
557
|
-
def load_and_process(file_path: str) -> Dict:
|
|
558
|
-
"""Load and process file with error handling.
|
|
559
|
-
|
|
560
|
-
Args:
|
|
561
|
-
file_path: Path to input file
|
|
562
|
-
|
|
563
|
-
Returns:
|
|
564
|
-
Processed data
|
|
565
|
-
|
|
566
|
-
Raises:
|
|
567
|
-
FileNotFoundError: If file doesn't exist
|
|
568
|
-
ValueError: If file format invalid
|
|
569
|
-
"""
|
|
570
|
-
import json
|
|
571
|
-
|
|
572
|
-
# Try-except for external operations
|
|
573
|
-
try:
|
|
574
|
-
with open(file_path, 'r') as f:
|
|
575
|
-
data = json.load(f)
|
|
576
|
-
except FileNotFoundError:
|
|
577
|
-
raise FileNotFoundError(f"File not found: {file_path}")
|
|
578
|
-
except json.JSONDecodeError as e:
|
|
579
|
-
raise ValueError(f"Invalid JSON in {file_path}: {e}")
|
|
580
|
-
|
|
581
|
-
# Validate structure
|
|
582
|
-
if not isinstance(data, dict):
|
|
583
|
-
raise ValueError(f"Expected dict, got {type(data)}")
|
|
584
|
-
|
|
585
|
-
# Process with error handling
|
|
586
|
-
try:
|
|
587
|
-
processed = _process_data(data)
|
|
588
|
-
except Exception as e:
|
|
589
|
-
raise RuntimeError(f"Processing failed: {e}")
|
|
590
|
-
|
|
591
|
-
return processed
|
|
592
|
-
```
|
|
593
|
-
|
|
594
|
-
---
|
|
595
|
-
|
|
596
|
-
## 6. File Organization
|
|
597
|
-
|
|
598
|
-
### Directory Structure
|
|
599
|
-
|
|
600
|
-
```
|
|
601
|
-
.scripts/py/
|
|
602
|
-
├── [topic]_analysis.py # Main analysis script
|
|
603
|
-
├── [topic]_utils.py # Reusable utilities
|
|
604
|
-
├── [topic]_config.py # Configuration constants
|
|
605
|
-
└── requirements.txt # Dependencies (if needed)
|
|
606
|
-
```
|
|
607
|
-
|
|
608
|
-
### Script Structure
|
|
609
|
-
|
|
610
|
-
```python
|
|
611
|
-
#!/usr/bin/env python3
|
|
612
|
-
"""Module docstring"""
|
|
613
|
-
|
|
614
|
-
# 1. Imports
|
|
615
|
-
import sys
|
|
616
|
-
from typing import Dict, List, Any
|
|
617
|
-
import pandas as pd
|
|
618
|
-
|
|
619
|
-
# 2. Constants/Configuration
|
|
620
|
-
DEFAULT_THRESHOLD = 0.5
|
|
621
|
-
SUPPORTED_FORMATS = ["json", "csv", "excel"]
|
|
622
|
-
|
|
623
|
-
# 3. Utility functions (private, prefix with _)
|
|
624
|
-
def _validate_input(data: Any) -> bool:
|
|
625
|
-
"""Private utility function."""
|
|
626
|
-
pass
|
|
627
|
-
|
|
628
|
-
def _format_output(results: Dict) -> str:
|
|
629
|
-
"""Private utility function."""
|
|
630
|
-
pass
|
|
631
|
-
|
|
632
|
-
# 4. Main functions (public)
|
|
633
|
-
def analyze_data(data: List[Dict]) -> Dict:
|
|
634
|
-
"""Public main function."""
|
|
635
|
-
pass
|
|
636
|
-
|
|
637
|
-
def export_results(results: Dict, output_path: str) -> None:
|
|
638
|
-
"""Public utility function."""
|
|
639
|
-
pass
|
|
640
|
-
|
|
641
|
-
# 5. CLI interface (if applicable)
|
|
642
|
-
def main():
|
|
643
|
-
"""Command-line interface."""
|
|
644
|
-
import argparse
|
|
645
|
-
|
|
646
|
-
parser = argparse.ArgumentParser(description="...")
|
|
647
|
-
parser.add_argument("command", choices=["analyze", "export"])
|
|
648
|
-
parser.add_argument("--input", required=True)
|
|
649
|
-
parser.add_argument("--output", required=True)
|
|
650
|
-
|
|
651
|
-
args = parser.parse_args()
|
|
652
|
-
|
|
653
|
-
# CLI logic
|
|
654
|
-
if args.command == "analyze":
|
|
655
|
-
data = load_data(args.input)
|
|
656
|
-
results = analyze_data(data)
|
|
657
|
-
export_results(results, args.output)
|
|
658
|
-
|
|
659
|
-
# 6. Main execution block
|
|
660
|
-
if __name__ == "__main__":
|
|
661
|
-
main()
|
|
662
|
-
```
|
|
663
|
-
|
|
664
|
-
---
|
|
665
|
-
|
|
666
|
-
## 7. Code Quality Checklist
|
|
667
|
-
|
|
668
|
-
Before finalizing any Python script, verify:
|
|
669
|
-
|
|
670
|
-
### Documentation
|
|
671
|
-
- [ ] Module docstring complete (purpose, usage, dependencies)
|
|
672
|
-
- [ ] All functions have docstrings
|
|
673
|
-
- [ ] Args section complete with types
|
|
674
|
-
- [ ] Returns section complete with types
|
|
675
|
-
- [ ] Raises section for functions that can fail
|
|
676
|
-
- [ ] Examples included for complex functions
|
|
677
|
-
|
|
678
|
-
### DRY Principle
|
|
679
|
-
- [ ] No duplicated code blocks
|
|
680
|
-
- [ ] Repeated logic extracted to functions
|
|
681
|
-
- [ ] Single responsibility per function
|
|
682
|
-
- [ ] Configuration via parameters, not hardcoded values
|
|
683
|
-
- [ ] Utility functions reusable across scripts
|
|
684
|
-
|
|
685
|
-
### Type Hints
|
|
686
|
-
- [ ] All function parameters have type hints
|
|
687
|
-
- [ ] All return types specified
|
|
688
|
-
- [ ] Complex types imported from typing module
|
|
689
|
-
- [ ] Optional parameters marked as Optional[Type]
|
|
690
|
-
|
|
691
|
-
### Error Handling
|
|
692
|
-
- [ ] Input validation at function start
|
|
693
|
-
- [ ] Type checking for critical parameters
|
|
694
|
-
- [ ] Try-except for external operations
|
|
695
|
-
- [ ] Informative error messages
|
|
696
|
-
- [ ] Proper exception types
|
|
697
|
-
|
|
698
|
-
### Code Style
|
|
699
|
-
- [ ] Consistent naming (snake_case for functions/variables)
|
|
700
|
-
- [ ] Functions are < 50 lines (if longer, split)
|
|
701
|
-
- [ ] Logical grouping of related functions
|
|
702
|
-
- [ ] Private functions prefixed with _
|
|
703
|
-
- [ ] No unused imports or variables
|
|
704
|
-
|
|
705
|
-
---
|
|
706
|
-
|
|
707
|
-
## 8. Example: Complete Well-Structured Script
|
|
708
|
-
|
|
709
|
-
```python
|
|
710
|
-
#!/usr/bin/env python3
|
|
711
|
-
"""Clinical Trial Statistical Analysis
|
|
712
|
-
|
|
713
|
-
This module provides functionality for:
|
|
714
|
-
- Loading trial data from multiple sources (Excel, CSV, database)
|
|
715
|
-
- Calculating response rates and survival statistics
|
|
716
|
-
- Performing statistical comparisons (chi-square, t-test, log-rank)
|
|
717
|
-
- Generating comprehensive reports with visualizations
|
|
718
|
-
|
|
719
|
-
Usage:
|
|
720
|
-
uv run python trial_stats.py analyze --input trials.xlsx --output results/
|
|
721
|
-
uv run python trial_stats.py compare --group1 Phase2 --group2 Phase3
|
|
722
|
-
uv run python trial_stats.py report --results results/ --output report.md
|
|
723
|
-
|
|
724
|
-
Dependencies:
|
|
725
|
-
- pandas >= 1.5.0
|
|
726
|
-
- scipy >= 1.9.0
|
|
727
|
-
- openpyxl >= 3.0.0
|
|
728
|
-
- matplotlib >= 3.6.0
|
|
729
|
-
- lifelines >= 0.27.0
|
|
730
|
-
|
|
731
|
-
Configuration:
|
|
732
|
-
env.jsonc for database connection (optional)
|
|
733
|
-
|
|
734
|
-
Output:
|
|
735
|
-
- JSON files with statistical results
|
|
736
|
-
- Excel files with aggregated data
|
|
737
|
-
- PNG files with visualizations
|
|
738
|
-
- Markdown report with findings
|
|
739
|
-
|
|
740
|
-
Author: BioResearcher AI Agent
|
|
741
|
-
Date: 2024-01-15
|
|
742
|
-
Version: 1.0.0
|
|
743
|
-
"""
|
|
744
|
-
|
|
745
|
-
import sys
|
|
746
|
-
from typing import Dict, List, Any, Optional, Tuple
|
|
747
|
-
import pandas as pd
|
|
748
|
-
import numpy as np
|
|
749
|
-
from scipy import stats
|
|
750
|
-
|
|
751
|
-
# Constants
|
|
752
|
-
DEFAULT_ALPHA = 0.05
|
|
753
|
-
MIN_SAMPLE_SIZE = 10
|
|
754
|
-
SUPPORTED_FORMATS = ["json", "csv", "excel"]
|
|
755
|
-
|
|
756
|
-
# Private utility functions
|
|
757
|
-
|
|
758
|
-
def _validate_dataframe(df: pd.DataFrame, required_cols: List[str]) -> None:
|
|
759
|
-
"""Validate DataFrame has required columns.
|
|
760
|
-
|
|
761
|
-
Args:
|
|
762
|
-
df: DataFrame to validate
|
|
763
|
-
required_cols: List of required column names
|
|
764
|
-
|
|
765
|
-
Raises:
|
|
766
|
-
ValueError: If required columns missing
|
|
767
|
-
"""
|
|
768
|
-
missing = [col for col in required_cols if col not in df.columns]
|
|
769
|
-
if missing:
|
|
770
|
-
raise ValueError(f"Missing required columns: {missing}")
|
|
771
|
-
|
|
772
|
-
def _calculate_confidence_interval(
|
|
773
|
-
data: np.ndarray,
|
|
774
|
-
confidence: float = 0.95
|
|
775
|
-
) -> Tuple[float, float]:
|
|
776
|
-
"""Calculate confidence interval for data.
|
|
777
|
-
|
|
778
|
-
Args:
|
|
779
|
-
data: Array of values
|
|
780
|
-
confidence: Confidence level (0.0 to 1.0)
|
|
781
|
-
|
|
782
|
-
Returns:
|
|
783
|
-
Tuple of (lower_bound, upper_bound)
|
|
784
|
-
"""
|
|
785
|
-
mean = np.mean(data)
|
|
786
|
-
sem = stats.sem(data)
|
|
787
|
-
interval = sem * stats.t.ppf((1 + confidence) / 2, len(data) - 1)
|
|
788
|
-
return (mean - interval, mean + interval)
|
|
789
|
-
|
|
790
|
-
# Main analysis functions
|
|
791
|
-
|
|
792
|
-
def load_trial_data(
|
|
793
|
-
file_path: str,
|
|
794
|
-
file_format: str = "excel"
|
|
795
|
-
) -> pd.DataFrame:
|
|
796
|
-
"""Load trial data from file.
|
|
797
|
-
|
|
798
|
-
Args:
|
|
799
|
-
file_path: Path to input file
|
|
800
|
-
file_format: File format ("excel", "csv", "json")
|
|
801
|
-
|
|
802
|
-
Returns:
|
|
803
|
-
DataFrame with trial data
|
|
804
|
-
|
|
805
|
-
Raises:
|
|
806
|
-
FileNotFoundError: If file doesn't exist
|
|
807
|
-
ValueError: If format unsupported or file invalid
|
|
808
|
-
"""
|
|
809
|
-
if file_format not in SUPPORTED_FORMATS:
|
|
810
|
-
raise ValueError(f"Unsupported format: {file_format}")
|
|
811
|
-
|
|
812
|
-
try:
|
|
813
|
-
if file_format == "excel":
|
|
814
|
-
df = pd.read_excel(file_path)
|
|
815
|
-
elif file_format == "csv":
|
|
816
|
-
df = pd.read_csv(file_path)
|
|
817
|
-
else: # json
|
|
818
|
-
df = pd.read_json(file_path)
|
|
819
|
-
except FileNotFoundError:
|
|
820
|
-
raise FileNotFoundError(f"File not found: {file_path}")
|
|
821
|
-
except Exception as e:
|
|
822
|
-
raise ValueError(f"Failed to load {file_path}: {e}")
|
|
823
|
-
|
|
824
|
-
# Validate structure
|
|
825
|
-
required = ["nct_id", "phase", "response_rate"]
|
|
826
|
-
_validate_dataframe(df, required)
|
|
827
|
-
|
|
828
|
-
return df
|
|
829
|
-
|
|
830
|
-
def compare_response_rates(
|
|
831
|
-
df: pd.DataFrame,
|
|
832
|
-
group_column: str = "phase",
|
|
833
|
-
alpha: float = DEFAULT_ALPHA
|
|
834
|
-
) -> Dict[str, Any]:
|
|
835
|
-
"""Compare response rates across groups with statistical testing.
|
|
836
|
-
|
|
837
|
-
Performs chi-square test for independence and calculates
|
|
838
|
-
confidence intervals for each group.
|
|
839
|
-
|
|
840
|
-
Args:
|
|
841
|
-
df: DataFrame containing trial data
|
|
842
|
-
group_column: Column to group by
|
|
843
|
-
alpha: Significance level (0.0 to 1.0)
|
|
844
|
-
|
|
845
|
-
Returns:
|
|
846
|
-
Dictionary containing:
|
|
847
|
-
{
|
|
848
|
-
"group_stats": Statistics per group,
|
|
849
|
-
"chi_square": Chi-square test results,
|
|
850
|
-
"significant": Boolean for significance,
|
|
851
|
-
"confidence_intervals": CI per group
|
|
852
|
-
}
|
|
853
|
-
|
|
854
|
-
Raises:
|
|
855
|
-
ValueError: If insufficient data or invalid alpha
|
|
856
|
-
"""
|
|
857
|
-
if not 0 < alpha < 1:
|
|
858
|
-
raise ValueError(f"alpha must be 0-1, got {alpha}")
|
|
859
|
-
|
|
860
|
-
if len(df) < MIN_SAMPLE_SIZE:
|
|
861
|
-
raise ValueError(f"Insufficient data: {len(df)} < {MIN_SAMPLE_SIZE}")
|
|
862
|
-
|
|
863
|
-
# Group statistics
|
|
864
|
-
group_stats = df.groupby(group_column)["response_rate"].agg(
|
|
865
|
-
["mean", "std", "count", "median"]
|
|
866
|
-
).to_dict()
|
|
867
|
-
|
|
868
|
-
# Chi-square test
|
|
869
|
-
contingency = pd.crosstab(
|
|
870
|
-
df[group_column],
|
|
871
|
-
(df["response_rate"] > df["response_rate"].median()).astype(int)
|
|
872
|
-
)
|
|
873
|
-
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
|
|
874
|
-
|
|
875
|
-
# Confidence intervals
|
|
876
|
-
ci_results = {}
|
|
877
|
-
for group in df[group_column].unique():
|
|
878
|
-
group_data = df[df[group_column] == group]["response_rate"].values
|
|
879
|
-
ci_low, ci_high = _calculate_confidence_interval(group_data)
|
|
880
|
-
ci_results[group] = {"lower": ci_low, "upper": ci_high}
|
|
881
|
-
|
|
882
|
-
return {
|
|
883
|
-
"group_stats": group_stats,
|
|
884
|
-
"chi_square": {
|
|
885
|
-
"statistic": chi2,
|
|
886
|
-
"p_value": p_value,
|
|
887
|
-
"degrees_of_freedom": dof
|
|
888
|
-
},
|
|
889
|
-
"significant": p_value < alpha,
|
|
890
|
-
"confidence_intervals": ci_results
|
|
891
|
-
}
|
|
892
|
-
|
|
893
|
-
# CLI interface
|
|
894
|
-
|
|
895
|
-
def main():
|
|
896
|
-
"""Command-line interface for trial statistics."""
|
|
897
|
-
import argparse
|
|
898
|
-
|
|
899
|
-
parser = argparse.ArgumentParser(
|
|
900
|
-
description="Clinical trial statistical analysis"
|
|
901
|
-
)
|
|
902
|
-
parser.add_argument("command", choices=["analyze", "compare"])
|
|
903
|
-
parser.add_argument("--input", required=True, help="Input file path")
|
|
904
|
-
parser.add_argument("--output", required=True, help="Output directory")
|
|
905
|
-
parser.add_argument(
|
|
906
|
-
"--format",
|
|
907
|
-
choices=SUPPORTED_FORMATS,
|
|
908
|
-
default="excel",
|
|
909
|
-
help="Input file format"
|
|
910
|
-
)
|
|
911
|
-
|
|
912
|
-
args = parser.parse_args()
|
|
913
|
-
|
|
914
|
-
# Load data
|
|
915
|
-
df = load_trial_data(args.input, args.format)
|
|
916
|
-
|
|
917
|
-
# Execute command
|
|
918
|
-
if args.command == "compare":
|
|
919
|
-
results = compare_response_rates(df)
|
|
920
|
-
# Export results
|
|
921
|
-
import json
|
|
922
|
-
with open(f"{args.output}/comparison.json", 'w') as f:
|
|
923
|
-
json.dump(results, f, indent=2)
|
|
924
|
-
print(f"Results saved to {args.output}/comparison.json")
|
|
925
|
-
|
|
926
|
-
if __name__ == "__main__":
|
|
927
|
-
main()
|
|
928
|
-
```
|
|
929
|
-
|
|
930
|
-
---
|
|
931
|
-
|
|
932
|
-
## Summary
|
|
933
|
-
|
|
934
|
-
**Every Python script must have:**
|
|
935
|
-
1. ✅ Complete module docstring
|
|
936
|
-
2. ✅ Function docstrings with Args/Returns/Raises/Example
|
|
937
|
-
3. ✅ No code duplication (DRY principle)
|
|
938
|
-
4. ✅ Type hints for all functions
|
|
939
|
-
5. ✅ Input validation and error handling
|
|
940
|
-
6. ✅ Single responsibility per function
|
|
941
|
-
7. ✅ Configuration via parameters
|
|
942
|
-
8. ✅ Proper file organization
|
|
943
|
-
|
|
944
|
-
**Follow this pattern for all Python code to ensure maintainability, reusability, and professional quality.**
|
|
1
|
+
# Python Code Standards (DRY Principle)
|
|
2
|
+
|
|
3
|
+
Comprehensive documentation requirements and DRY (Don't Repeat Yourself) principle enforcement for all Python scripts.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
This pattern defines mandatory standards for Python code written by the bioresearcher agent:
|
|
8
|
+
- Complete documentation (module and function docstrings)
|
|
9
|
+
- DRY principle enforcement (no code duplication)
|
|
10
|
+
- Type hints and validation
|
|
11
|
+
- Error handling best practices
|
|
12
|
+
- Reusable component design
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
## 1. Module-Level Documentation
|
|
17
|
+
|
|
18
|
+
### Required Elements
|
|
19
|
+
|
|
20
|
+
**EVERY Python script MUST include:**
|
|
21
|
+
|
|
22
|
+
```python
|
|
23
|
+
#!/usr/bin/env python3
|
|
24
|
+
"""[Script Purpose - One Line Description]
|
|
25
|
+
|
|
26
|
+
This module provides functionality for:
|
|
27
|
+
- [Functionality 1 - brief description]
|
|
28
|
+
- [Functionality 2 - brief description]
|
|
29
|
+
- [Functionality 3 - brief description]
|
|
30
|
+
|
|
31
|
+
Usage:
|
|
32
|
+
uv run python script_name.py [command] [options]
|
|
33
|
+
|
|
34
|
+
Examples:
|
|
35
|
+
uv run python script_name.py process --input data.xlsx --output results.json
|
|
36
|
+
uv run python script_name.py validate --schema schema.json
|
|
37
|
+
uv run python script_name.py analyze --config config.yaml
|
|
38
|
+
|
|
39
|
+
Dependencies:
|
|
40
|
+
- pandas >= 1.5.0
|
|
41
|
+
- openpyxl >= 3.0.0
|
|
42
|
+
- [Other dependencies with version requirements]
|
|
43
|
+
|
|
44
|
+
Configuration:
|
|
45
|
+
[Any configuration files or environment variables needed]
|
|
46
|
+
|
|
47
|
+
Output:
|
|
48
|
+
[Description of output format and location]
|
|
49
|
+
|
|
50
|
+
Author: BioResearcher AI Agent
|
|
51
|
+
Date: YYYY-MM-DD
|
|
52
|
+
Version: 1.0.0
|
|
53
|
+
"""
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### Example: Well-Documented Module
|
|
57
|
+
|
|
58
|
+
```python
|
|
59
|
+
#!/usr/bin/env python3
|
|
60
|
+
"""Clinical Trial Response Rate Analysis
|
|
61
|
+
|
|
62
|
+
This module provides functionality for:
|
|
63
|
+
- Loading and validating clinical trial data from Excel/CSV files
|
|
64
|
+
- Calculating response rates across trial phases and conditions
|
|
65
|
+
- Performing statistical comparisons (chi-square, t-test)
|
|
66
|
+
- Generating summary reports with visualizations
|
|
67
|
+
|
|
68
|
+
Usage:
|
|
69
|
+
uv run python trial_analysis.py analyze --input trials.xlsx --output results/
|
|
70
|
+
uv run python trial_analysis.py compare --phase1 "Phase 2" --phase2 "Phase 3"
|
|
71
|
+
uv run python trial_analysis.py report --results results/ --output report.md
|
|
72
|
+
|
|
73
|
+
Dependencies:
|
|
74
|
+
- pandas >= 1.5.0
|
|
75
|
+
- scipy >= 1.9.0
|
|
76
|
+
- openpyxl >= 3.0.0
|
|
77
|
+
- matplotlib >= 3.6.0
|
|
78
|
+
|
|
79
|
+
Configuration:
|
|
80
|
+
Requires env.jsonc with database connection (optional)
|
|
81
|
+
|
|
82
|
+
Output:
|
|
83
|
+
- JSON files with statistical results
|
|
84
|
+
- Excel files with aggregated data
|
|
85
|
+
- Markdown report with findings
|
|
86
|
+
|
|
87
|
+
Author: BioResearcher AI Agent
|
|
88
|
+
Date: 2024-01-15
|
|
89
|
+
Version: 1.0.0
|
|
90
|
+
"""
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
---
|
|
94
|
+
|
|
95
|
+
## 2. Function Documentation
|
|
96
|
+
|
|
97
|
+
### Standard Docstring Format
|
|
98
|
+
|
|
99
|
+
**Required sections:**
|
|
100
|
+
1. Brief description (one line)
|
|
101
|
+
2. Extended description (if needed)
|
|
102
|
+
3. Args (with types and descriptions)
|
|
103
|
+
4. Returns (with type and description)
|
|
104
|
+
5. Raises (if applicable)
|
|
105
|
+
6. Example (recommended)
|
|
106
|
+
|
|
107
|
+
### Template
|
|
108
|
+
|
|
109
|
+
```python
|
|
110
|
+
def function_name(
|
|
111
|
+
param1: Type1,
|
|
112
|
+
param2: Type2,
|
|
113
|
+
optional_param: Type3 = default_value
|
|
114
|
+
) -> ReturnType:
|
|
115
|
+
"""Brief one-line description of function.
|
|
116
|
+
|
|
117
|
+
Extended description explaining what the function does,
|
|
118
|
+
why it's needed, and any important behavior notes.
|
|
119
|
+
|
|
120
|
+
Args:
|
|
121
|
+
param1: Description of param1
|
|
122
|
+
param2: Description of param2
|
|
123
|
+
Can span multiple lines for complex parameters
|
|
124
|
+
optional_param: Description of optional parameter
|
|
125
|
+
Default: default_value
|
|
126
|
+
|
|
127
|
+
Returns:
|
|
128
|
+
Description of return value
|
|
129
|
+
For complex returns, describe structure:
|
|
130
|
+
{
|
|
131
|
+
"key1": type - description,
|
|
132
|
+
"key2": type - description
|
|
133
|
+
}
|
|
134
|
+
|
|
135
|
+
Raises:
|
|
136
|
+
ExceptionType: When/why this exception is raised
|
|
137
|
+
AnotherException: Description
|
|
138
|
+
|
|
139
|
+
Example:
|
|
140
|
+
>>> result = function_name(param1="value", param2=42)
|
|
141
|
+
>>> print(result["key"])
|
|
142
|
+
expected_output
|
|
143
|
+
"""
|
|
144
|
+
# Implementation
|
|
145
|
+
pass
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Example: Well-Documented Function
|
|
149
|
+
|
|
150
|
+
```python
|
|
151
|
+
def analyze_clinical_trials(
|
|
152
|
+
trials_data: List[Dict[str, Any]],
|
|
153
|
+
filter_criteria: Dict[str, str],
|
|
154
|
+
output_format: str = "json"
|
|
155
|
+
) -> Dict[str, Any]:
|
|
156
|
+
"""Analyze clinical trial data with specified filters.
|
|
157
|
+
|
|
158
|
+
Performs comprehensive analysis of clinical trial records including:
|
|
159
|
+
- Filtering by phase, status, and condition
|
|
160
|
+
- Statistical aggregation by sponsor and location
|
|
161
|
+
- Response rate comparison across trial phases
|
|
162
|
+
|
|
163
|
+
Args:
|
|
164
|
+
trials_data: List of trial dictionaries from dbQuery or API
|
|
165
|
+
Each dict must contain: nct_id, phase, status, condition,
|
|
166
|
+
sponsor, response_rate, patient_count
|
|
167
|
+
filter_criteria: Dictionary of filter conditions
|
|
168
|
+
Example: {"phase": "Phase 3", "status": "Recruiting"}
|
|
169
|
+
Supported keys: phase, status, condition, sponsor
|
|
170
|
+
output_format: Output format for results
|
|
171
|
+
Options: "json", "dataframe", "dict"
|
|
172
|
+
Default: "json"
|
|
173
|
+
|
|
174
|
+
Returns:
|
|
175
|
+
Dictionary containing analysis results:
|
|
176
|
+
{
|
|
177
|
+
"filtered_count": int - Number of trials after filtering,
|
|
178
|
+
"statistics": {
|
|
179
|
+
"avg_response_rate": float,
|
|
180
|
+
"total_patients": int,
|
|
181
|
+
"by_phase": Dict[str, Dict]
|
|
182
|
+
},
|
|
183
|
+
"trials": List[Dict] - Filtered trial records
|
|
184
|
+
}
|
|
185
|
+
|
|
186
|
+
Raises:
|
|
187
|
+
ValueError: If filter_criteria contains unsupported keys
|
|
188
|
+
KeyError: If trials_data missing required fields
|
|
189
|
+
TypeError: If trials_data is not a list
|
|
190
|
+
|
|
191
|
+
Example:
|
|
192
|
+
>>> trials = dbQuery("SELECT * FROM clinical_trials")
|
|
193
|
+
>>> results = analyze_clinical_trials(
|
|
194
|
+
... trials_data=trials,
|
|
195
|
+
... filter_criteria={"phase": "Phase 3"},
|
|
196
|
+
... output_format="json"
|
|
197
|
+
... )
|
|
198
|
+
>>> print(results["filtered_count"])
|
|
199
|
+
42
|
|
200
|
+
>>> print(results["statistics"]["avg_response_rate"])
|
|
201
|
+
0.65
|
|
202
|
+
"""
|
|
203
|
+
# Validate inputs
|
|
204
|
+
_validate_filter_criteria(filter_criteria)
|
|
205
|
+
validated_data = _validate_trials(trials_data)
|
|
206
|
+
|
|
207
|
+
# Apply filters (reusable function)
|
|
208
|
+
filtered = _apply_filters(validated_data, filter_criteria)
|
|
209
|
+
|
|
210
|
+
# Calculate statistics (reusable function)
|
|
211
|
+
stats = _calculate_statistics(filtered)
|
|
212
|
+
|
|
213
|
+
# Format output
|
|
214
|
+
if output_format == "json":
|
|
215
|
+
return {
|
|
216
|
+
"filtered_count": len(filtered),
|
|
217
|
+
"statistics": stats,
|
|
218
|
+
"trials": filtered
|
|
219
|
+
}
|
|
220
|
+
elif output_format == "dataframe":
|
|
221
|
+
import pandas as pd
|
|
222
|
+
return {
|
|
223
|
+
"filtered_count": len(filtered),
|
|
224
|
+
"statistics": stats,
|
|
225
|
+
"trials": pd.DataFrame(filtered)
|
|
226
|
+
}
|
|
227
|
+
else:
|
|
228
|
+
return {
|
|
229
|
+
"filtered_count": len(filtered),
|
|
230
|
+
"statistics": stats,
|
|
231
|
+
"trials": filtered
|
|
232
|
+
}
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
---
|
|
236
|
+
|
|
237
|
+
## 3. DRY Principle Enforcement
|
|
238
|
+
|
|
239
|
+
### Core Principles
|
|
240
|
+
|
|
241
|
+
1. **Single Responsibility** - Each function does ONE thing
|
|
242
|
+
2. **No Code Duplication** - If code appears twice, extract to function
|
|
243
|
+
3. **Reusable Components** - Design functions for reuse across scripts
|
|
244
|
+
4. **Configuration Over Hardcoding** - Use parameters, not hardcoded values
|
|
245
|
+
|
|
246
|
+
### Violation Examples and Corrections
|
|
247
|
+
|
|
248
|
+
#### Violation 1: Code Duplication
|
|
249
|
+
|
|
250
|
+
```python
|
|
251
|
+
# ❌ BAD: Duplicated validation logic
|
|
252
|
+
def analyze_trials(trials):
|
|
253
|
+
"""Analyze trials with inline validation."""
|
|
254
|
+
for trial in trials:
|
|
255
|
+
if not trial.get("nct_id"):
|
|
256
|
+
raise ValueError("Missing NCT ID")
|
|
257
|
+
if not trial.get("phase"):
|
|
258
|
+
raise ValueError("Missing phase")
|
|
259
|
+
if not trial.get("status"):
|
|
260
|
+
raise ValueError("Missing status")
|
|
261
|
+
# ... analysis logic
|
|
262
|
+
|
|
263
|
+
def export_trials(trials):
|
|
264
|
+
"""Export trials with inline validation."""
|
|
265
|
+
for trial in trials:
|
|
266
|
+
if not trial.get("nct_id"):
|
|
267
|
+
raise ValueError("Missing NCT ID")
|
|
268
|
+
if not trial.get("phase"):
|
|
269
|
+
raise ValueError("Missing phase")
|
|
270
|
+
if not trial.get("status"):
|
|
271
|
+
raise ValueError("Missing status")
|
|
272
|
+
# ... export logic
|
|
273
|
+
```
|
|
274
|
+
|
|
275
|
+
```python
|
|
276
|
+
# ✅ GOOD: DRY principle - reusable validation function
|
|
277
|
+
def _validate_trial(trial: Dict[str, Any]) -> bool:
|
|
278
|
+
"""Validate single trial record.
|
|
279
|
+
|
|
280
|
+
Checks for required fields and data types.
|
|
281
|
+
Reusable across multiple functions.
|
|
282
|
+
|
|
283
|
+
Args:
|
|
284
|
+
trial: Trial dictionary to validate
|
|
285
|
+
|
|
286
|
+
Returns:
|
|
287
|
+
True if valid
|
|
288
|
+
|
|
289
|
+
Raises:
|
|
290
|
+
ValueError: If required fields missing or invalid
|
|
291
|
+
"""
|
|
292
|
+
required_fields = {
|
|
293
|
+
"nct_id": str,
|
|
294
|
+
"phase": str,
|
|
295
|
+
"status": str,
|
|
296
|
+
"response_rate": (int, float),
|
|
297
|
+
"patient_count": int
|
|
298
|
+
}
|
|
299
|
+
|
|
300
|
+
for field, expected_type in required_fields.items():
|
|
301
|
+
if not trial.get(field):
|
|
302
|
+
raise ValueError(f"Missing required field: {field}")
|
|
303
|
+
if not isinstance(trial[field], expected_type):
|
|
304
|
+
raise ValueError(
|
|
305
|
+
f"Field {field} has wrong type: "
|
|
306
|
+
f"expected {expected_type}, got {type(trial[field])}"
|
|
307
|
+
)
|
|
308
|
+
|
|
309
|
+
return True
|
|
310
|
+
|
|
311
|
+
def analyze_trials(trials: List[Dict]) -> Dict:
|
|
312
|
+
"""Analyze trials with reusable validation."""
|
|
313
|
+
validated = [_validate_trial(t) for t in trials]
|
|
314
|
+
# ... analysis logic
|
|
315
|
+
|
|
316
|
+
def export_trials(trials: List[Dict], output_path: str) -> None:
|
|
317
|
+
"""Export trials with reusable validation."""
|
|
318
|
+
validated = [_validate_trial(t) for t in trials]
|
|
319
|
+
# ... export logic
|
|
320
|
+
```
|
|
321
|
+
|
|
322
|
+
#### Violation 2: Hardcoded Values
|
|
323
|
+
|
|
324
|
+
```python
|
|
325
|
+
# ❌ BAD: Hardcoded configuration
|
|
326
|
+
def analyze_data(data):
|
|
327
|
+
"""Analyze with hardcoded thresholds."""
|
|
328
|
+
if len(data) > 1000:
|
|
329
|
+
print("Large dataset detected")
|
|
330
|
+
|
|
331
|
+
for item in data:
|
|
332
|
+
if item["value"] > 0.5: # Magic number
|
|
333
|
+
item["category"] = "High"
|
|
334
|
+
else:
|
|
335
|
+
item["category"] = "Low"
|
|
336
|
+
|
|
337
|
+
return data
|
|
338
|
+
```
|
|
339
|
+
|
|
340
|
+
```python
|
|
341
|
+
# ✅ GOOD: Configuration via parameters
|
|
342
|
+
def analyze_data(
|
|
343
|
+
data: List[Dict],
|
|
344
|
+
large_dataset_threshold: int = 1000,
|
|
345
|
+
category_threshold: float = 0.5
|
|
346
|
+
) -> List[Dict]:
|
|
347
|
+
"""Analyze with configurable thresholds.
|
|
348
|
+
|
|
349
|
+
Args:
|
|
350
|
+
data: List of data items
|
|
351
|
+
large_dataset_threshold: Threshold for large dataset warning
|
|
352
|
+
Default: 1000
|
|
353
|
+
category_threshold: Threshold for High/Low categorization
|
|
354
|
+
Default: 0.5
|
|
355
|
+
|
|
356
|
+
Returns:
|
|
357
|
+
Data with categories assigned
|
|
358
|
+
"""
|
|
359
|
+
if len(data) > large_dataset_threshold:
|
|
360
|
+
print(f"Large dataset detected: {len(data)} items")
|
|
361
|
+
|
|
362
|
+
for item in data:
|
|
363
|
+
if item["value"] > category_threshold:
|
|
364
|
+
item["category"] = "High"
|
|
365
|
+
else:
|
|
366
|
+
item["category"] = "Low"
|
|
367
|
+
|
|
368
|
+
return data
|
|
369
|
+
```
|
|
370
|
+
|
|
371
|
+
#### Violation 3: Multiple Responsibilities
|
|
372
|
+
|
|
373
|
+
```python
|
|
374
|
+
# ❌ BAD: Function does too many things
|
|
375
|
+
def process_trials(file_path):
|
|
376
|
+
"""Load, validate, filter, analyze, and export trials."""
|
|
377
|
+
# Load data
|
|
378
|
+
df = pd.read_excel(file_path)
|
|
379
|
+
|
|
380
|
+
# Validate
|
|
381
|
+
if df.empty:
|
|
382
|
+
raise ValueError("Empty file")
|
|
383
|
+
|
|
384
|
+
# Filter
|
|
385
|
+
df = df[df["phase"] == "Phase 3"]
|
|
386
|
+
|
|
387
|
+
# Analyze
|
|
388
|
+
stats = df.groupby("condition")["response_rate"].mean()
|
|
389
|
+
|
|
390
|
+
# Export
|
|
391
|
+
stats.to_excel("output.xlsx")
|
|
392
|
+
|
|
393
|
+
# Visualize
|
|
394
|
+
stats.plot(kind="bar")
|
|
395
|
+
plt.savefig("plot.png")
|
|
396
|
+
|
|
397
|
+
return stats
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
```python
|
|
401
|
+
# ✅ GOOD: Single responsibility per function
|
|
402
|
+
def load_trials(file_path: str) -> pd.DataFrame:
|
|
403
|
+
"""Load trials from Excel file."""
|
|
404
|
+
return pd.read_excel(file_path)
|
|
405
|
+
|
|
406
|
+
def validate_trials(df: pd.DataFrame) -> pd.DataFrame:
|
|
407
|
+
"""Validate trial data."""
|
|
408
|
+
if df.empty:
|
|
409
|
+
raise ValueError("Empty file")
|
|
410
|
+
required_cols = ["nct_id", "phase", "response_rate"]
|
|
411
|
+
missing = [col for col in required_cols if col not in df.columns]
|
|
412
|
+
if missing:
|
|
413
|
+
raise ValueError(f"Missing columns: {missing}")
|
|
414
|
+
return df
|
|
415
|
+
|
|
416
|
+
def filter_trials(
|
|
417
|
+
df: pd.DataFrame,
|
|
418
|
+
phase: str = None,
|
|
419
|
+
status: str = None
|
|
420
|
+
) -> pd.DataFrame:
|
|
421
|
+
"""Filter trials by criteria."""
|
|
422
|
+
if phase:
|
|
423
|
+
df = df[df["phase"] == phase]
|
|
424
|
+
if status:
|
|
425
|
+
df = df[df["status"] == status]
|
|
426
|
+
return df
|
|
427
|
+
|
|
428
|
+
def analyze_trials(df: pd.DataFrame) -> pd.Series:
|
|
429
|
+
"""Calculate statistics by condition."""
|
|
430
|
+
return df.groupby("condition")["response_rate"].mean()
|
|
431
|
+
|
|
432
|
+
def export_results(stats: pd.Series, output_path: str) -> None:
|
|
433
|
+
"""Export statistics to Excel."""
|
|
434
|
+
stats.to_excel(output_path)
|
|
435
|
+
|
|
436
|
+
def create_visualization(stats: pd.Series, output_path: str) -> None:
|
|
437
|
+
"""Create bar chart visualization."""
|
|
438
|
+
import matplotlib.pyplot as plt
|
|
439
|
+
stats.plot(kind="bar")
|
|
440
|
+
plt.savefig(output_path)
|
|
441
|
+
|
|
442
|
+
# Main workflow
|
|
443
|
+
def process_trials(file_path: str, output_dir: str) -> pd.Series:
|
|
444
|
+
"""Process trials using modular functions."""
|
|
445
|
+
df = load_trials(file_path)
|
|
446
|
+
df = validate_trials(df)
|
|
447
|
+
df = filter_trials(df, phase="Phase 3")
|
|
448
|
+
stats = analyze_trials(df)
|
|
449
|
+
export_results(stats, f"{output_dir}/stats.xlsx")
|
|
450
|
+
create_visualization(stats, f"{output_dir}/plot.png")
|
|
451
|
+
return stats
|
|
452
|
+
```
|
|
453
|
+
|
|
454
|
+
---
|
|
455
|
+
|
|
456
|
+
## 4. Type Hints
|
|
457
|
+
|
|
458
|
+
### Why Type Hints Matter
|
|
459
|
+
|
|
460
|
+
- Improve code readability
|
|
461
|
+
- Enable IDE autocomplete and type checking
|
|
462
|
+
- Document expected types
|
|
463
|
+
- Catch errors early
|
|
464
|
+
|
|
465
|
+
### Standard Type Hint Patterns
|
|
466
|
+
|
|
467
|
+
```python
|
|
468
|
+
from typing import Dict, List, Any, Optional, Union, Tuple
|
|
469
|
+
|
|
470
|
+
# Basic types
|
|
471
|
+
def process_string(text: str) -> str:
|
|
472
|
+
pass
|
|
473
|
+
|
|
474
|
+
# Collections
|
|
475
|
+
def process_list(items: List[str]) -> List[int]:
|
|
476
|
+
pass
|
|
477
|
+
|
|
478
|
+
def process_dict(data: Dict[str, Any]) -> Dict[str, float]:
|
|
479
|
+
pass
|
|
480
|
+
|
|
481
|
+
# Optional parameters
|
|
482
|
+
def search(
|
|
483
|
+
query: str,
|
|
484
|
+
limit: Optional[int] = None
|
|
485
|
+
) -> List[Dict]:
|
|
486
|
+
pass
|
|
487
|
+
|
|
488
|
+
# Union types
|
|
489
|
+
def parse_value(value: Union[str, int, float]) -> float:
|
|
490
|
+
pass
|
|
491
|
+
|
|
492
|
+
# Tuple returns
|
|
493
|
+
def get_stats(data: List[float]) -> Tuple[float, float, float]:
|
|
494
|
+
"""Returns (mean, std, median)."""
|
|
495
|
+
pass
|
|
496
|
+
|
|
497
|
+
# Complex structures
|
|
498
|
+
def analyze_trials(
|
|
499
|
+
trials: List[Dict[str, Union[str, int, float]]]
|
|
500
|
+
) -> Dict[str, Any]:
|
|
501
|
+
pass
|
|
502
|
+
```
|
|
503
|
+
|
|
504
|
+
---
|
|
505
|
+
|
|
506
|
+
## 5. Error Handling
|
|
507
|
+
|
|
508
|
+
### Validation Pattern
|
|
509
|
+
|
|
510
|
+
```python
|
|
511
|
+
def analyze_data(data: List[Dict], threshold: float) -> Dict:
|
|
512
|
+
"""Analyze data with comprehensive validation.
|
|
513
|
+
|
|
514
|
+
Args:
|
|
515
|
+
data: List of data items
|
|
516
|
+
threshold: Analysis threshold (0.0 to 1.0)
|
|
517
|
+
|
|
518
|
+
Returns:
|
|
519
|
+
Analysis results
|
|
520
|
+
|
|
521
|
+
Raises:
|
|
522
|
+
TypeError: If data is not a list
|
|
523
|
+
ValueError: If threshold out of range or data empty
|
|
524
|
+
"""
|
|
525
|
+
# Type validation
|
|
526
|
+
if not isinstance(data, list):
|
|
527
|
+
raise TypeError(f"data must be list, got {type(data)}")
|
|
528
|
+
|
|
529
|
+
if not isinstance(threshold, (int, float)):
|
|
530
|
+
raise TypeError(f"threshold must be numeric, got {type(threshold)}")
|
|
531
|
+
|
|
532
|
+
# Value validation
|
|
533
|
+
if not 0.0 <= threshold <= 1.0:
|
|
534
|
+
raise ValueError(f"threshold must be 0.0-1.0, got {threshold}")
|
|
535
|
+
|
|
536
|
+
if len(data) == 0:
|
|
537
|
+
raise ValueError("data cannot be empty")
|
|
538
|
+
|
|
539
|
+
# Field validation
|
|
540
|
+
for i, item in enumerate(data):
|
|
541
|
+
if "value" not in item:
|
|
542
|
+
raise ValueError(f"Item {i} missing 'value' field")
|
|
543
|
+
|
|
544
|
+
# Analysis logic
|
|
545
|
+
results = [item for item in data if item["value"] > threshold]
|
|
546
|
+
|
|
547
|
+
return {
|
|
548
|
+
"count": len(results),
|
|
549
|
+
"threshold": threshold,
|
|
550
|
+
"items": results
|
|
551
|
+
}
|
|
552
|
+
```
|
|
553
|
+
|
|
554
|
+
### Exception Handling Pattern
|
|
555
|
+
|
|
556
|
+
```python
|
|
557
|
+
def load_and_process(file_path: str) -> Dict:
|
|
558
|
+
"""Load and process file with error handling.
|
|
559
|
+
|
|
560
|
+
Args:
|
|
561
|
+
file_path: Path to input file
|
|
562
|
+
|
|
563
|
+
Returns:
|
|
564
|
+
Processed data
|
|
565
|
+
|
|
566
|
+
Raises:
|
|
567
|
+
FileNotFoundError: If file doesn't exist
|
|
568
|
+
ValueError: If file format invalid
|
|
569
|
+
"""
|
|
570
|
+
import json
|
|
571
|
+
|
|
572
|
+
# Try-except for external operations
|
|
573
|
+
try:
|
|
574
|
+
with open(file_path, 'r') as f:
|
|
575
|
+
data = json.load(f)
|
|
576
|
+
except FileNotFoundError:
|
|
577
|
+
raise FileNotFoundError(f"File not found: {file_path}")
|
|
578
|
+
except json.JSONDecodeError as e:
|
|
579
|
+
raise ValueError(f"Invalid JSON in {file_path}: {e}")
|
|
580
|
+
|
|
581
|
+
# Validate structure
|
|
582
|
+
if not isinstance(data, dict):
|
|
583
|
+
raise ValueError(f"Expected dict, got {type(data)}")
|
|
584
|
+
|
|
585
|
+
# Process with error handling
|
|
586
|
+
try:
|
|
587
|
+
processed = _process_data(data)
|
|
588
|
+
except Exception as e:
|
|
589
|
+
raise RuntimeError(f"Processing failed: {e}")
|
|
590
|
+
|
|
591
|
+
return processed
|
|
592
|
+
```
|
|
593
|
+
|
|
594
|
+
---
|
|
595
|
+
|
|
596
|
+
## 6. File Organization
|
|
597
|
+
|
|
598
|
+
### Directory Structure
|
|
599
|
+
|
|
600
|
+
```
|
|
601
|
+
.scripts/py/
|
|
602
|
+
├── [topic]_analysis.py # Main analysis script
|
|
603
|
+
├── [topic]_utils.py # Reusable utilities
|
|
604
|
+
├── [topic]_config.py # Configuration constants
|
|
605
|
+
└── requirements.txt # Dependencies (if needed)
|
|
606
|
+
```
|
|
607
|
+
|
|
608
|
+
### Script Structure
|
|
609
|
+
|
|
610
|
+
```python
|
|
611
|
+
#!/usr/bin/env python3
|
|
612
|
+
"""Module docstring"""
|
|
613
|
+
|
|
614
|
+
# 1. Imports
|
|
615
|
+
import sys
|
|
616
|
+
from typing import Dict, List, Any
|
|
617
|
+
import pandas as pd
|
|
618
|
+
|
|
619
|
+
# 2. Constants/Configuration
|
|
620
|
+
DEFAULT_THRESHOLD = 0.5
|
|
621
|
+
SUPPORTED_FORMATS = ["json", "csv", "excel"]
|
|
622
|
+
|
|
623
|
+
# 3. Utility functions (private, prefix with _)
|
|
624
|
+
def _validate_input(data: Any) -> bool:
|
|
625
|
+
"""Private utility function."""
|
|
626
|
+
pass
|
|
627
|
+
|
|
628
|
+
def _format_output(results: Dict) -> str:
|
|
629
|
+
"""Private utility function."""
|
|
630
|
+
pass
|
|
631
|
+
|
|
632
|
+
# 4. Main functions (public)
|
|
633
|
+
def analyze_data(data: List[Dict]) -> Dict:
|
|
634
|
+
"""Public main function."""
|
|
635
|
+
pass
|
|
636
|
+
|
|
637
|
+
def export_results(results: Dict, output_path: str) -> None:
|
|
638
|
+
"""Public utility function."""
|
|
639
|
+
pass
|
|
640
|
+
|
|
641
|
+
# 5. CLI interface (if applicable)
|
|
642
|
+
def main():
|
|
643
|
+
"""Command-line interface."""
|
|
644
|
+
import argparse
|
|
645
|
+
|
|
646
|
+
parser = argparse.ArgumentParser(description="...")
|
|
647
|
+
parser.add_argument("command", choices=["analyze", "export"])
|
|
648
|
+
parser.add_argument("--input", required=True)
|
|
649
|
+
parser.add_argument("--output", required=True)
|
|
650
|
+
|
|
651
|
+
args = parser.parse_args()
|
|
652
|
+
|
|
653
|
+
# CLI logic
|
|
654
|
+
if args.command == "analyze":
|
|
655
|
+
data = load_data(args.input)
|
|
656
|
+
results = analyze_data(data)
|
|
657
|
+
export_results(results, args.output)
|
|
658
|
+
|
|
659
|
+
# 6. Main execution block
|
|
660
|
+
if __name__ == "__main__":
|
|
661
|
+
main()
|
|
662
|
+
```
|
|
663
|
+
|
|
664
|
+
---
|
|
665
|
+
|
|
666
|
+
## 7. Code Quality Checklist
|
|
667
|
+
|
|
668
|
+
Before finalizing any Python script, verify:
|
|
669
|
+
|
|
670
|
+
### Documentation
|
|
671
|
+
- [ ] Module docstring complete (purpose, usage, dependencies)
|
|
672
|
+
- [ ] All functions have docstrings
|
|
673
|
+
- [ ] Args section complete with types
|
|
674
|
+
- [ ] Returns section complete with types
|
|
675
|
+
- [ ] Raises section for functions that can fail
|
|
676
|
+
- [ ] Examples included for complex functions
|
|
677
|
+
|
|
678
|
+
### DRY Principle
|
|
679
|
+
- [ ] No duplicated code blocks
|
|
680
|
+
- [ ] Repeated logic extracted to functions
|
|
681
|
+
- [ ] Single responsibility per function
|
|
682
|
+
- [ ] Configuration via parameters, not hardcoded values
|
|
683
|
+
- [ ] Utility functions reusable across scripts
|
|
684
|
+
|
|
685
|
+
### Type Hints
|
|
686
|
+
- [ ] All function parameters have type hints
|
|
687
|
+
- [ ] All return types specified
|
|
688
|
+
- [ ] Complex types imported from typing module
|
|
689
|
+
- [ ] Optional parameters marked as Optional[Type]
|
|
690
|
+
|
|
691
|
+
### Error Handling
|
|
692
|
+
- [ ] Input validation at function start
|
|
693
|
+
- [ ] Type checking for critical parameters
|
|
694
|
+
- [ ] Try-except for external operations
|
|
695
|
+
- [ ] Informative error messages
|
|
696
|
+
- [ ] Proper exception types
|
|
697
|
+
|
|
698
|
+
### Code Style
|
|
699
|
+
- [ ] Consistent naming (snake_case for functions/variables)
|
|
700
|
+
- [ ] Functions are < 50 lines (if longer, split)
|
|
701
|
+
- [ ] Logical grouping of related functions
|
|
702
|
+
- [ ] Private functions prefixed with _
|
|
703
|
+
- [ ] No unused imports or variables
|
|
704
|
+
|
|
705
|
+
---
|
|
706
|
+
|
|
707
|
+
## 8. Example: Complete Well-Structured Script
|
|
708
|
+
|
|
709
|
+
```python
|
|
710
|
+
#!/usr/bin/env python3
|
|
711
|
+
"""Clinical Trial Statistical Analysis
|
|
712
|
+
|
|
713
|
+
This module provides functionality for:
|
|
714
|
+
- Loading trial data from multiple sources (Excel, CSV, database)
|
|
715
|
+
- Calculating response rates and survival statistics
|
|
716
|
+
- Performing statistical comparisons (chi-square, t-test, log-rank)
|
|
717
|
+
- Generating comprehensive reports with visualizations
|
|
718
|
+
|
|
719
|
+
Usage:
|
|
720
|
+
uv run python trial_stats.py analyze --input trials.xlsx --output results/
|
|
721
|
+
uv run python trial_stats.py compare --group1 Phase2 --group2 Phase3
|
|
722
|
+
uv run python trial_stats.py report --results results/ --output report.md
|
|
723
|
+
|
|
724
|
+
Dependencies:
|
|
725
|
+
- pandas >= 1.5.0
|
|
726
|
+
- scipy >= 1.9.0
|
|
727
|
+
- openpyxl >= 3.0.0
|
|
728
|
+
- matplotlib >= 3.6.0
|
|
729
|
+
- lifelines >= 0.27.0
|
|
730
|
+
|
|
731
|
+
Configuration:
|
|
732
|
+
env.jsonc for database connection (optional)
|
|
733
|
+
|
|
734
|
+
Output:
|
|
735
|
+
- JSON files with statistical results
|
|
736
|
+
- Excel files with aggregated data
|
|
737
|
+
- PNG files with visualizations
|
|
738
|
+
- Markdown report with findings
|
|
739
|
+
|
|
740
|
+
Author: BioResearcher AI Agent
|
|
741
|
+
Date: 2024-01-15
|
|
742
|
+
Version: 1.0.0
|
|
743
|
+
"""
|
|
744
|
+
|
|
745
|
+
import sys
|
|
746
|
+
from typing import Dict, List, Any, Optional, Tuple
|
|
747
|
+
import pandas as pd
|
|
748
|
+
import numpy as np
|
|
749
|
+
from scipy import stats
|
|
750
|
+
|
|
751
|
+
# Constants
|
|
752
|
+
DEFAULT_ALPHA = 0.05
|
|
753
|
+
MIN_SAMPLE_SIZE = 10
|
|
754
|
+
SUPPORTED_FORMATS = ["json", "csv", "excel"]
|
|
755
|
+
|
|
756
|
+
# Private utility functions
|
|
757
|
+
|
|
758
|
+
def _validate_dataframe(df: pd.DataFrame, required_cols: List[str]) -> None:
|
|
759
|
+
"""Validate DataFrame has required columns.
|
|
760
|
+
|
|
761
|
+
Args:
|
|
762
|
+
df: DataFrame to validate
|
|
763
|
+
required_cols: List of required column names
|
|
764
|
+
|
|
765
|
+
Raises:
|
|
766
|
+
ValueError: If required columns missing
|
|
767
|
+
"""
|
|
768
|
+
missing = [col for col in required_cols if col not in df.columns]
|
|
769
|
+
if missing:
|
|
770
|
+
raise ValueError(f"Missing required columns: {missing}")
|
|
771
|
+
|
|
772
|
+
def _calculate_confidence_interval(
|
|
773
|
+
data: np.ndarray,
|
|
774
|
+
confidence: float = 0.95
|
|
775
|
+
) -> Tuple[float, float]:
|
|
776
|
+
"""Calculate confidence interval for data.
|
|
777
|
+
|
|
778
|
+
Args:
|
|
779
|
+
data: Array of values
|
|
780
|
+
confidence: Confidence level (0.0 to 1.0)
|
|
781
|
+
|
|
782
|
+
Returns:
|
|
783
|
+
Tuple of (lower_bound, upper_bound)
|
|
784
|
+
"""
|
|
785
|
+
mean = np.mean(data)
|
|
786
|
+
sem = stats.sem(data)
|
|
787
|
+
interval = sem * stats.t.ppf((1 + confidence) / 2, len(data) - 1)
|
|
788
|
+
return (mean - interval, mean + interval)
|
|
789
|
+
|
|
790
|
+
# Main analysis functions
|
|
791
|
+
|
|
792
|
+
def load_trial_data(
|
|
793
|
+
file_path: str,
|
|
794
|
+
file_format: str = "excel"
|
|
795
|
+
) -> pd.DataFrame:
|
|
796
|
+
"""Load trial data from file.
|
|
797
|
+
|
|
798
|
+
Args:
|
|
799
|
+
file_path: Path to input file
|
|
800
|
+
file_format: File format ("excel", "csv", "json")
|
|
801
|
+
|
|
802
|
+
Returns:
|
|
803
|
+
DataFrame with trial data
|
|
804
|
+
|
|
805
|
+
Raises:
|
|
806
|
+
FileNotFoundError: If file doesn't exist
|
|
807
|
+
ValueError: If format unsupported or file invalid
|
|
808
|
+
"""
|
|
809
|
+
if file_format not in SUPPORTED_FORMATS:
|
|
810
|
+
raise ValueError(f"Unsupported format: {file_format}")
|
|
811
|
+
|
|
812
|
+
try:
|
|
813
|
+
if file_format == "excel":
|
|
814
|
+
df = pd.read_excel(file_path)
|
|
815
|
+
elif file_format == "csv":
|
|
816
|
+
df = pd.read_csv(file_path)
|
|
817
|
+
else: # json
|
|
818
|
+
df = pd.read_json(file_path)
|
|
819
|
+
except FileNotFoundError:
|
|
820
|
+
raise FileNotFoundError(f"File not found: {file_path}")
|
|
821
|
+
except Exception as e:
|
|
822
|
+
raise ValueError(f"Failed to load {file_path}: {e}")
|
|
823
|
+
|
|
824
|
+
# Validate structure
|
|
825
|
+
required = ["nct_id", "phase", "response_rate"]
|
|
826
|
+
_validate_dataframe(df, required)
|
|
827
|
+
|
|
828
|
+
return df
|
|
829
|
+
|
|
830
|
+
def compare_response_rates(
|
|
831
|
+
df: pd.DataFrame,
|
|
832
|
+
group_column: str = "phase",
|
|
833
|
+
alpha: float = DEFAULT_ALPHA
|
|
834
|
+
) -> Dict[str, Any]:
|
|
835
|
+
"""Compare response rates across groups with statistical testing.
|
|
836
|
+
|
|
837
|
+
Performs chi-square test for independence and calculates
|
|
838
|
+
confidence intervals for each group.
|
|
839
|
+
|
|
840
|
+
Args:
|
|
841
|
+
df: DataFrame containing trial data
|
|
842
|
+
group_column: Column to group by
|
|
843
|
+
alpha: Significance level (0.0 to 1.0)
|
|
844
|
+
|
|
845
|
+
Returns:
|
|
846
|
+
Dictionary containing:
|
|
847
|
+
{
|
|
848
|
+
"group_stats": Statistics per group,
|
|
849
|
+
"chi_square": Chi-square test results,
|
|
850
|
+
"significant": Boolean for significance,
|
|
851
|
+
"confidence_intervals": CI per group
|
|
852
|
+
}
|
|
853
|
+
|
|
854
|
+
Raises:
|
|
855
|
+
ValueError: If insufficient data or invalid alpha
|
|
856
|
+
"""
|
|
857
|
+
if not 0 < alpha < 1:
|
|
858
|
+
raise ValueError(f"alpha must be 0-1, got {alpha}")
|
|
859
|
+
|
|
860
|
+
if len(df) < MIN_SAMPLE_SIZE:
|
|
861
|
+
raise ValueError(f"Insufficient data: {len(df)} < {MIN_SAMPLE_SIZE}")
|
|
862
|
+
|
|
863
|
+
# Group statistics
|
|
864
|
+
group_stats = df.groupby(group_column)["response_rate"].agg(
|
|
865
|
+
["mean", "std", "count", "median"]
|
|
866
|
+
).to_dict()
|
|
867
|
+
|
|
868
|
+
# Chi-square test
|
|
869
|
+
contingency = pd.crosstab(
|
|
870
|
+
df[group_column],
|
|
871
|
+
(df["response_rate"] > df["response_rate"].median()).astype(int)
|
|
872
|
+
)
|
|
873
|
+
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
|
|
874
|
+
|
|
875
|
+
# Confidence intervals
|
|
876
|
+
ci_results = {}
|
|
877
|
+
for group in df[group_column].unique():
|
|
878
|
+
group_data = df[df[group_column] == group]["response_rate"].values
|
|
879
|
+
ci_low, ci_high = _calculate_confidence_interval(group_data)
|
|
880
|
+
ci_results[group] = {"lower": ci_low, "upper": ci_high}
|
|
881
|
+
|
|
882
|
+
return {
|
|
883
|
+
"group_stats": group_stats,
|
|
884
|
+
"chi_square": {
|
|
885
|
+
"statistic": chi2,
|
|
886
|
+
"p_value": p_value,
|
|
887
|
+
"degrees_of_freedom": dof
|
|
888
|
+
},
|
|
889
|
+
"significant": p_value < alpha,
|
|
890
|
+
"confidence_intervals": ci_results
|
|
891
|
+
}
|
|
892
|
+
|
|
893
|
+
# CLI interface
|
|
894
|
+
|
|
895
|
+
def main():
|
|
896
|
+
"""Command-line interface for trial statistics."""
|
|
897
|
+
import argparse
|
|
898
|
+
|
|
899
|
+
parser = argparse.ArgumentParser(
|
|
900
|
+
description="Clinical trial statistical analysis"
|
|
901
|
+
)
|
|
902
|
+
parser.add_argument("command", choices=["analyze", "compare"])
|
|
903
|
+
parser.add_argument("--input", required=True, help="Input file path")
|
|
904
|
+
parser.add_argument("--output", required=True, help="Output directory")
|
|
905
|
+
parser.add_argument(
|
|
906
|
+
"--format",
|
|
907
|
+
choices=SUPPORTED_FORMATS,
|
|
908
|
+
default="excel",
|
|
909
|
+
help="Input file format"
|
|
910
|
+
)
|
|
911
|
+
|
|
912
|
+
args = parser.parse_args()
|
|
913
|
+
|
|
914
|
+
# Load data
|
|
915
|
+
df = load_trial_data(args.input, args.format)
|
|
916
|
+
|
|
917
|
+
# Execute command
|
|
918
|
+
if args.command == "compare":
|
|
919
|
+
results = compare_response_rates(df)
|
|
920
|
+
# Export results
|
|
921
|
+
import json
|
|
922
|
+
with open(f"{args.output}/comparison.json", 'w') as f:
|
|
923
|
+
json.dump(results, f, indent=2)
|
|
924
|
+
print(f"Results saved to {args.output}/comparison.json")
|
|
925
|
+
|
|
926
|
+
if __name__ == "__main__":
|
|
927
|
+
main()
|
|
928
|
+
```
|
|
929
|
+
|
|
930
|
+
---
|
|
931
|
+
|
|
932
|
+
## Summary
|
|
933
|
+
|
|
934
|
+
**Every Python script must have:**
|
|
935
|
+
1. ✅ Complete module docstring
|
|
936
|
+
2. ✅ Function docstrings with Args/Returns/Raises/Example
|
|
937
|
+
3. ✅ No code duplication (DRY principle)
|
|
938
|
+
4. ✅ Type hints for all functions
|
|
939
|
+
5. ✅ Input validation and error handling
|
|
940
|
+
6. ✅ Single responsibility per function
|
|
941
|
+
7. ✅ Configuration via parameters
|
|
942
|
+
8. ✅ Proper file organization
|
|
943
|
+
|
|
944
|
+
**Follow this pattern for all Python code to ensure maintainability, reusability, and professional quality.**
|