pytrials-v2 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pytrials_v2-0.1.0/.github/workflows/publish.yml +24 -0
- pytrials_v2-0.1.0/.gitignore +34 -0
- pytrials_v2-0.1.0/AGENTS.md +23 -0
- pytrials_v2-0.1.0/LICENSE +21 -0
- pytrials_v2-0.1.0/PKG-INFO +82 -0
- pytrials_v2-0.1.0/PROJECT_PLAN.md +397 -0
- pytrials_v2-0.1.0/README.md +41 -0
- pytrials_v2-0.1.0/demo/ctg_playground.py +119 -0
- pytrials_v2-0.1.0/pyproject.toml +58 -0
- pytrials_v2-0.1.0/src/pytrials/__init__.py +63 -0
- pytrials_v2-0.1.0/src/pytrials/client.py +87 -0
- pytrials_v2-0.1.0/src/pytrials/errors.py +28 -0
- pytrials_v2-0.1.0/src/pytrials/models/__init__.py +33 -0
- pytrials_v2-0.1.0/src/pytrials/models/enums.py +46 -0
- pytrials_v2-0.1.0/src/pytrials/models/search.py +17 -0
- pytrials_v2-0.1.0/src/pytrials/models/study.py +133 -0
- pytrials_v2-0.1.0/src/pytrials/modules/__init__.py +7 -0
- pytrials_v2-0.1.0/src/pytrials/modules/studies.py +81 -0
- pytrials_v2-0.1.0/tests/conftest.py +29 -0
- pytrials_v2-0.1.0/tests/fixtures/search_response.json +39 -0
- pytrials_v2-0.1.0/tests/fixtures/study_detail.json +43 -0
- pytrials_v2-0.1.0/tests/test_studies.py +86 -0
- pytrials_v2-0.1.0/uv.lock +661 -0
|
@@ -0,0 +1,24 @@
|
|
|
1
|
+
name: Publish to PyPI
|
|
2
|
+
|
|
3
|
+
# Publishes on a version tag (e.g. v0.1.0). Uses PyPI Trusted Publishing
|
|
4
|
+
# (OIDC), so no API token is stored in the repo. Configure the trusted
|
|
5
|
+
# publisher once on PyPI: project settings -> Publishing -> add a GitHub
|
|
6
|
+
# publisher for prahlaadr/pytrials-v2, workflow publish.yml, environment pypi.
|
|
7
|
+
on:
|
|
8
|
+
push:
|
|
9
|
+
tags:
|
|
10
|
+
- "v*"
|
|
11
|
+
|
|
12
|
+
jobs:
|
|
13
|
+
publish:
|
|
14
|
+
runs-on: ubuntu-latest
|
|
15
|
+
environment: pypi
|
|
16
|
+
permissions:
|
|
17
|
+
id-token: write
|
|
18
|
+
steps:
|
|
19
|
+
- uses: actions/checkout@v4
|
|
20
|
+
- uses: astral-sh/setup-uv@v5
|
|
21
|
+
- name: Build
|
|
22
|
+
run: uv build
|
|
23
|
+
- name: Publish
|
|
24
|
+
uses: pypa/gh-action-pypi-publish@release/v1
|
|
@@ -0,0 +1,34 @@
|
|
|
1
|
+
# Python
|
|
2
|
+
__pycache__/
|
|
3
|
+
*.py[cod]
|
|
4
|
+
*.egg-info/
|
|
5
|
+
.eggs/
|
|
6
|
+
build/
|
|
7
|
+
dist/
|
|
8
|
+
# Envs
|
|
9
|
+
.venv/
|
|
10
|
+
venv/
|
|
11
|
+
.env
|
|
12
|
+
.env.local
|
|
13
|
+
# Tooling caches
|
|
14
|
+
.mypy_cache/
|
|
15
|
+
.ruff_cache/
|
|
16
|
+
.pytest_cache/
|
|
17
|
+
.coverage
|
|
18
|
+
htmlcov/
|
|
19
|
+
# uv
|
|
20
|
+
.uv/
|
|
21
|
+
# OS / editor
|
|
22
|
+
.DS_Store
|
|
23
|
+
.idea/
|
|
24
|
+
.vscode/
|
|
25
|
+
# docs build
|
|
26
|
+
site/
|
|
27
|
+
|
|
28
|
+
# marimo wasm build output
|
|
29
|
+
demo/dist/
|
|
30
|
+
.vercel/
|
|
31
|
+
|
|
32
|
+
# build artifacts
|
|
33
|
+
dist/
|
|
34
|
+
*.egg-info/
|
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
# AGENTS.md
|
|
2
|
+
|
|
3
|
+
Python SDK for the ClinicalTrials.gov API v2. Full design and roadmap in `PROJECT_PLAN.md`.
|
|
4
|
+
|
|
5
|
+
## Conventions
|
|
6
|
+
- Python 3.10+. Package/deps via `uv`. Build backend: hatch.
|
|
7
|
+
- Lint/format: ruff. Types: mypy strict. Tests: pytest + pytest-asyncio, mock httpx with respx.
|
|
8
|
+
- Never use em dashes in any output (code, comments, docs, commits). Use periods, commas, colons, or parentheses.
|
|
9
|
+
- Never commit secrets (this API needs none; it is public and unauthenticated).
|
|
10
|
+
|
|
11
|
+
## Core stack (from the plan)
|
|
12
|
+
httpx, pydantic v2, tenacity, respx, pytest, ruff, mypy, mkdocs-material, mike, hatch.
|
|
13
|
+
|
|
14
|
+
## Recommended libraries to speed the build
|
|
15
|
+
- **datamodel-code-generator** — generate the Pydantic v2 models directly from the ClinicalTrials.gov OpenAPI 3.0 spec instead of hand-writing the deeply nested hierarchy. Biggest time saver.
|
|
16
|
+
- **aiolimiter** — async token-bucket rate limiter (the 50 req/min cap), instead of hand-rolling.
|
|
17
|
+
- **hishel** — httpx-native HTTP caching, useful given the rate limit and large study payloads.
|
|
18
|
+
- **stamina** — modern retry wrapper over tenacity, cleaner ergonomics (optional; tenacity is fine).
|
|
19
|
+
- **Typer** — for the v1.0 CLI (`pytrials search ...`).
|
|
20
|
+
- **Polars** for the DataFrame layer (matches the house data stack: DuckDB then Polars, avoid pandas). Offer `.to_polars()` as primary, `.to_pandas()` as a thin convenience.
|
|
21
|
+
|
|
22
|
+
## Demo / frontend
|
|
23
|
+
A live interactive demo (clinical-trial search playground) is the artifact's "live" surface. Best fit: **Marimo** (reactive Python notebook, export to WASM static HTML, deploy to a pyaarproject subdomain). Alternatives: Streamlit or Gradio for a quick search UI, FastHTML for a fuller pure-Python web app.
|
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 Prahlaad R.
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
|
@@ -0,0 +1,82 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pytrials-v2
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: A modern, fully-typed Python SDK for the ClinicalTrials.gov API v2.
|
|
5
|
+
Project-URL: Homepage, https://github.com/raami/pytrials-v2
|
|
6
|
+
Project-URL: Repository, https://github.com/raami/pytrials-v2
|
|
7
|
+
Author: Prahlaad Ram
|
|
8
|
+
License: MIT License
|
|
9
|
+
|
|
10
|
+
Copyright (c) 2026 Prahlaad R.
|
|
11
|
+
|
|
12
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
13
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
14
|
+
in the Software without restriction, including without limitation the rights
|
|
15
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
16
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
17
|
+
furnished to do so, subject to the following conditions:
|
|
18
|
+
|
|
19
|
+
The above copyright notice and this permission notice shall be included in all
|
|
20
|
+
copies or substantial portions of the Software.
|
|
21
|
+
|
|
22
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
23
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
24
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
25
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
26
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
27
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
28
|
+
SOFTWARE.
|
|
29
|
+
License-File: LICENSE
|
|
30
|
+
Keywords: api,clinical-trials,clinicaltrials,healthcare,sdk
|
|
31
|
+
Classifier: Development Status :: 3 - Alpha
|
|
32
|
+
Classifier: Intended Audience :: Science/Research
|
|
33
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
34
|
+
Classifier: Programming Language :: Python :: 3 :: Only
|
|
35
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
36
|
+
Classifier: Typing :: Typed
|
|
37
|
+
Requires-Python: >=3.10
|
|
38
|
+
Requires-Dist: httpx>=0.27
|
|
39
|
+
Requires-Dist: pydantic>=2
|
|
40
|
+
Description-Content-Type: text/markdown
|
|
41
|
+
|
|
42
|
+
# pytrials-v2
|
|
43
|
+
|
|
44
|
+
A modern, fully-typed Python SDK for the [ClinicalTrials.gov API v2](https://clinicaltrials.gov/data-api/api).
|
|
45
|
+
|
|
46
|
+
> Status: early development. See [PROJECT_PLAN.md](./PROJECT_PLAN.md) for the full design and roadmap.
|
|
47
|
+
|
|
48
|
+
## Why
|
|
49
|
+
|
|
50
|
+
The ClinicalTrials.gov API v2 (JSON, token pagination, OpenAPI 3.0) launched in 2024, and the old XML v1 API was retired. There is still no well-designed Python SDK for v2. pytrials-v2 aims to be the default: Pydantic-modeled, async-capable, with the ergonomic helpers clinical-trial data consumers actually need.
|
|
51
|
+
|
|
52
|
+
## What it offers
|
|
53
|
+
|
|
54
|
+
- Full Pydantic v2 models for every API response (real autocomplete, no dict-digging)
|
|
55
|
+
- A validating QueryBuilder that catches bad status, phase, and sort values before the request
|
|
56
|
+
- Async auto-pagination over `pageToken`
|
|
57
|
+
- DataFrame integration that flattens the nested study structure for analysis
|
|
58
|
+
- Date normalization across the API's inconsistent formats
|
|
59
|
+
- Built-in rate limiting (50 req/min) and retry with backoff
|
|
60
|
+
|
|
61
|
+
## Quickstart (planned API)
|
|
62
|
+
|
|
63
|
+
```python
|
|
64
|
+
from pytrials import ClinicalTrials
|
|
65
|
+
|
|
66
|
+
ctg = ClinicalTrials()
|
|
67
|
+
|
|
68
|
+
results = ctg.studies.search(condition="breast cancer", status=["RECRUITING"], phase=["PHASE3"])
|
|
69
|
+
study = ctg.studies.get("NCT04852770")
|
|
70
|
+
df = ctg.studies.search(condition="diabetes", status=["RECRUITING"]).to_dataframe()
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Roadmap
|
|
74
|
+
|
|
75
|
+
- **v0.1.0 Core**: client, search/get, core models, error handling, PyPI publish
|
|
76
|
+
- **v0.2.0 Ergonomics**: QueryBuilder, async paginator, stats endpoints, rate limiting
|
|
77
|
+
- **v0.3.0 Data science**: DataFrame integration, docs site, 90%+ coverage
|
|
78
|
+
- **v1.0.0 Stable**: full results-section models, CLI, notebook examples
|
|
79
|
+
|
|
80
|
+
## License
|
|
81
|
+
|
|
82
|
+
MIT
|
|
@@ -0,0 +1,397 @@
|
|
|
1
|
+
# pytrials-v2: ClinicalTrials.gov API v2 Python SDK
|
|
2
|
+
|
|
3
|
+
## Project overview
|
|
4
|
+
|
|
5
|
+
A modern, fully-typed Python SDK for the ClinicalTrials.gov API v2. No authentication required (public API). The goal is to become the default Python library for anyone working with clinical trial data programmatically.
|
|
6
|
+
|
|
7
|
+
**Package name:** `pytrials-v2`
|
|
8
|
+
**PyPI:** `pip install pytrials-v2`
|
|
9
|
+
**GitHub:** `github.com/raami/pytrials-v2`
|
|
10
|
+
**License:** MIT
|
|
11
|
+
**Python:** 3.10+
|
|
12
|
+
|
|
13
|
+
---
|
|
14
|
+
|
|
15
|
+
## Why this exists
|
|
16
|
+
|
|
17
|
+
The ClinicalTrials.gov API v2 launched in March 2024 as a complete rewrite: JSON responses, token-based pagination, OpenAPI 3.0 spec, enumerated values instead of free text. The old v1 API (XML-based) was retired in June 2024. There is currently no proper Python SDK for v2. What exists:
|
|
18
|
+
|
|
19
|
+
- `pytrials` on PyPI: partially updated for v2, minimal type coverage, no async, no pagination helper, no query builder
|
|
20
|
+
- `clinical-trials-api` on npm: JavaScript, toy-level
|
|
21
|
+
- MCP servers (cyanheads, etc.): designed for LLM tool use, not developer integration
|
|
22
|
+
- Raw `requests.get()` examples in blog posts
|
|
23
|
+
|
|
24
|
+
The gap: a well-designed, Pydantic-modeled, async-capable Python SDK with ergonomic helpers for the patterns that clinical trial data consumers actually need.
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## ClinicalTrials.gov API v2 endpoints
|
|
29
|
+
|
|
30
|
+
Base URL: `https://clinicaltrials.gov/api/v2`
|
|
31
|
+
|
|
32
|
+
| Endpoint | Method | Description |
|
|
33
|
+
|---|---|---|
|
|
34
|
+
| `/studies` | GET | Search studies with query params and filters |
|
|
35
|
+
| `/studies/{nctId}` | GET | Get full study record by NCT ID |
|
|
36
|
+
| `/stats/size` | GET | Get total study count for a query |
|
|
37
|
+
| `/stats/fieldValues` | GET | Get valid values for a field with counts |
|
|
38
|
+
| `/stats/listFields` | GET | Browse the study data model field tree |
|
|
39
|
+
| `/version` | GET | API version and data timestamp |
|
|
40
|
+
| `/studies/metadata` | GET | Study data structure metadata |
|
|
41
|
+
|
|
42
|
+
**No auth required.** Rate limit is approximately 50 requests per minute per IP.
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## SDK architecture
|
|
47
|
+
|
|
48
|
+
### Layer 1: Public API surface
|
|
49
|
+
|
|
50
|
+
```python
|
|
51
|
+
from pytrials import ClinicalTrials
|
|
52
|
+
|
|
53
|
+
ctg = ClinicalTrials()
|
|
54
|
+
|
|
55
|
+
# Search studies
|
|
56
|
+
results = ctg.studies.search(
|
|
57
|
+
condition="breast cancer",
|
|
58
|
+
status=["RECRUITING"],
|
|
59
|
+
phase=["PHASE3"],
|
|
60
|
+
page_size=100,
|
|
61
|
+
sort="LastUpdatePostDate:desc"
|
|
62
|
+
)
|
|
63
|
+
|
|
64
|
+
# Get single study
|
|
65
|
+
study = ctg.studies.get("NCT04852770")
|
|
66
|
+
|
|
67
|
+
# Count studies matching a query
|
|
68
|
+
count = ctg.stats.size(condition="diabetes", status=["RECRUITING"])
|
|
69
|
+
|
|
70
|
+
# Get valid values for a field
|
|
71
|
+
phases = ctg.stats.field_values("OverallStatus")
|
|
72
|
+
|
|
73
|
+
# API version info
|
|
74
|
+
version = ctg.version()
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
#### Module namespaces
|
|
78
|
+
|
|
79
|
+
**ctg.studies**
|
|
80
|
+
- `search(**kwargs)` -- search with query params, returns `StudySearchResult`
|
|
81
|
+
- `get(nct_id)` -- fetch single study, returns `Study`
|
|
82
|
+
- `search_all(**kwargs)` -- async generator, auto-paginates through all results
|
|
83
|
+
- `bulk_get(nct_ids: list)` -- fetch multiple studies by NCT ID list
|
|
84
|
+
|
|
85
|
+
**ctg.stats**
|
|
86
|
+
- `size(**kwargs)` -- total count for a query, returns `int`
|
|
87
|
+
- `field_values(field, **kwargs)` -- enum values with counts, returns `list[FieldValue]`
|
|
88
|
+
- `list_fields(parent=None)` -- browse study data model tree
|
|
89
|
+
|
|
90
|
+
**ctg.metadata**
|
|
91
|
+
- `fields()` -- available fields for field selection
|
|
92
|
+
- `search_areas()` -- valid search area definitions
|
|
93
|
+
|
|
94
|
+
**ctg.version**
|
|
95
|
+
- `__call__()` -- returns `VersionInfo` with api_version and data_timestamp
|
|
96
|
+
|
|
97
|
+
### Layer 2: QueryBuilder (fluent interface)
|
|
98
|
+
|
|
99
|
+
```python
|
|
100
|
+
from pytrials import ClinicalTrials, Query
|
|
101
|
+
|
|
102
|
+
ctg = ClinicalTrials()
|
|
103
|
+
|
|
104
|
+
# Fluent query construction
|
|
105
|
+
query = (
|
|
106
|
+
Query()
|
|
107
|
+
.condition("non-small cell lung cancer")
|
|
108
|
+
.intervention("pembrolizumab")
|
|
109
|
+
.sponsor("Merck")
|
|
110
|
+
.location("United States")
|
|
111
|
+
.status("RECRUITING", "NOT_YET_RECRUITING")
|
|
112
|
+
.phase("PHASE3")
|
|
113
|
+
.sort("LastUpdatePostDate:desc")
|
|
114
|
+
)
|
|
115
|
+
|
|
116
|
+
results = ctg.studies.search(query, page_size=50)
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
The `Query` object validates enum values at construction time (status, phase) and raises `InvalidQueryError` before hitting the API.
|
|
120
|
+
|
|
121
|
+
### Layer 3: Pydantic v2 models
|
|
122
|
+
|
|
123
|
+
Every API response is a validated Pydantic model. The study data structure is deeply nested, so models mirror the hierarchy:
|
|
124
|
+
|
|
125
|
+
```
|
|
126
|
+
Study
|
|
127
|
+
protocolSection
|
|
128
|
+
identificationModule (nctId, briefTitle, officialTitle, organization)
|
|
129
|
+
statusModule (overallStatus, startDateStruct, completionDateStruct)
|
|
130
|
+
sponsorCollaboratorsModule (leadSponsor, collaborators)
|
|
131
|
+
descriptionModule (briefSummary, detailedDescription)
|
|
132
|
+
conditionsModule (conditions, keywords)
|
|
133
|
+
designModule (studyType, phases, enrollmentInfo, designInfo)
|
|
134
|
+
armsInterventionsModule (armGroups, interventions)
|
|
135
|
+
outcomesModule (primaryOutcomes, secondaryOutcomes)
|
|
136
|
+
eligibilityModule (criteria, healthyVolunteers, sex, minimumAge, maximumAge)
|
|
137
|
+
contactsLocationsModule (overallOfficials, locations)
|
|
138
|
+
referencesModule (references, seeAlsoLinks)
|
|
139
|
+
derivedSection
|
|
140
|
+
miscInfoModule
|
|
141
|
+
conditionBrowseModule (meshTerms)
|
|
142
|
+
interventionBrowseModule (meshTerms)
|
|
143
|
+
resultsSection (when hasResults=True)
|
|
144
|
+
participantFlowModule
|
|
145
|
+
baselineCharacteristicsModule
|
|
146
|
+
outcomeMeasuresModule
|
|
147
|
+
adverseEventsModule
|
|
148
|
+
hasResults: bool
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
Key design decisions:
|
|
152
|
+
- All fields are `Optional` with sensible defaults (the API has many nullable fields)
|
|
153
|
+
- Date fields use a custom `CTGDate` type that handles the inconsistent formats ("2024-01-15", "January 2024", "January 15, 2024")
|
|
154
|
+
- Enum fields (status, phase, study type) are Python enums with validation
|
|
155
|
+
- `model_config = ConfigDict(extra="allow")` so new API fields don't break the SDK
|
|
156
|
+
|
|
157
|
+
### Layer 4: Core HTTP client
|
|
158
|
+
|
|
159
|
+
```python
|
|
160
|
+
# Internal, not public API
|
|
161
|
+
class CTGClient:
|
|
162
|
+
base_url = "https://clinicaltrials.gov/api/v2"
|
|
163
|
+
|
|
164
|
+
# httpx async client with:
|
|
165
|
+
# - Automatic retry with exponential backoff (tenacity)
|
|
166
|
+
# - Rate limiting (50 req/min token bucket)
|
|
167
|
+
# - Configurable timeout (default 30s)
|
|
168
|
+
# - Response validation via Pydantic
|
|
169
|
+
# - Structured CTGError on 4xx/5xx
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
### Layer 5: Pagination
|
|
173
|
+
|
|
174
|
+
```python
|
|
175
|
+
# Auto-paginate through all results
|
|
176
|
+
async for study in ctg.studies.search_all(condition="cancer"):
|
|
177
|
+
process(study)
|
|
178
|
+
|
|
179
|
+
# Or collect all at once (careful with large result sets)
|
|
180
|
+
all_studies = await ctg.studies.search_all(
|
|
181
|
+
condition="cancer",
|
|
182
|
+
status=["COMPLETED"]
|
|
183
|
+
).collect()
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
The paginator handles `pageToken` transparently, respects rate limits between pages, and yields `Study` objects one at a time via async generator.
|
|
187
|
+
|
|
188
|
+
### Layer 6: pandas integration
|
|
189
|
+
|
|
190
|
+
```python
|
|
191
|
+
# Direct DataFrame output
|
|
192
|
+
df = ctg.studies.search(
|
|
193
|
+
condition="diabetes",
|
|
194
|
+
status=["RECRUITING"],
|
|
195
|
+
page_size=100
|
|
196
|
+
).to_dataframe()
|
|
197
|
+
|
|
198
|
+
# Flattened columns: nct_id, brief_title, overall_status,
|
|
199
|
+
# lead_sponsor, phase, enrollment, start_date, ...
|
|
200
|
+
```
|
|
201
|
+
|
|
202
|
+
The `.to_dataframe()` method flattens the nested study structure into a tabular format suitable for analysis. Users can specify which fields to include.
|
|
203
|
+
|
|
204
|
+
---
|
|
205
|
+
|
|
206
|
+
## Project structure
|
|
207
|
+
|
|
208
|
+
```
|
|
209
|
+
pytrials-v2/
|
|
210
|
+
src/
|
|
211
|
+
pytrials/
|
|
212
|
+
__init__.py # ClinicalTrials client, Query, enums
|
|
213
|
+
client.py # CTGClient (httpx, retry, rate limit)
|
|
214
|
+
models/
|
|
215
|
+
__init__.py
|
|
216
|
+
study.py # Study, ProtocolSection, etc.
|
|
217
|
+
search.py # StudySearchResult, pagination
|
|
218
|
+
stats.py # FieldValue, VersionInfo
|
|
219
|
+
enums.py # OverallStatus, Phase, StudyType
|
|
220
|
+
dates.py # CTGDate custom type
|
|
221
|
+
modules/
|
|
222
|
+
__init__.py
|
|
223
|
+
studies.py # StudiesModule (search, get, search_all)
|
|
224
|
+
stats.py # StatsModule (size, field_values)
|
|
225
|
+
metadata.py # MetadataModule (fields, search_areas)
|
|
226
|
+
query.py # QueryBuilder fluent interface
|
|
227
|
+
pagination.py # AsyncPaginator generator
|
|
228
|
+
errors.py # CTGError, RateLimitError, NotFoundError
|
|
229
|
+
pandas_ext.py # .to_dataframe() integration
|
|
230
|
+
tests/
|
|
231
|
+
conftest.py # respx fixtures, sample responses
|
|
232
|
+
test_client.py
|
|
233
|
+
test_studies.py
|
|
234
|
+
test_stats.py
|
|
235
|
+
test_query.py
|
|
236
|
+
test_pagination.py
|
|
237
|
+
test_models.py
|
|
238
|
+
fixtures/
|
|
239
|
+
search_response.json
|
|
240
|
+
study_detail.json
|
|
241
|
+
stats_size.json
|
|
242
|
+
docs/
|
|
243
|
+
index.md
|
|
244
|
+
quickstart.md
|
|
245
|
+
api-reference/
|
|
246
|
+
studies.md
|
|
247
|
+
stats.md
|
|
248
|
+
query-builder.md
|
|
249
|
+
models.md
|
|
250
|
+
guides/
|
|
251
|
+
pagination.md
|
|
252
|
+
pandas-integration.md
|
|
253
|
+
common-patterns.md # Competitive intel, site selection, patient matching
|
|
254
|
+
pyproject.toml
|
|
255
|
+
README.md
|
|
256
|
+
LICENSE
|
|
257
|
+
CHANGELOG.md
|
|
258
|
+
.github/
|
|
259
|
+
workflows/
|
|
260
|
+
ci.yml # pytest, mypy, ruff, coverage
|
|
261
|
+
publish.yml # PyPI publish on tag
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
---
|
|
265
|
+
|
|
266
|
+
## Tech stack
|
|
267
|
+
|
|
268
|
+
| Tool | Purpose |
|
|
269
|
+
|---|---|
|
|
270
|
+
| httpx | Async HTTP client (better than requests for async) |
|
|
271
|
+
| pydantic v2 | Response validation and type safety |
|
|
272
|
+
| tenacity | Retry with exponential backoff |
|
|
273
|
+
| respx | Mock httpx in tests |
|
|
274
|
+
| pytest + pytest-asyncio | Test framework |
|
|
275
|
+
| ruff | Linting and formatting |
|
|
276
|
+
| mypy (strict) | Static type checking |
|
|
277
|
+
| mkdocs-material | Documentation site |
|
|
278
|
+
| mike | Docs versioning |
|
|
279
|
+
| hatch | Build backend and environment management |
|
|
280
|
+
|
|
281
|
+
---
|
|
282
|
+
|
|
283
|
+
## Release roadmap
|
|
284
|
+
|
|
285
|
+
### v0.1.0: Core (weeks 1-2)
|
|
286
|
+
|
|
287
|
+
Ship a working SDK that covers the basic use cases.
|
|
288
|
+
|
|
289
|
+
- [ ] `ClinicalTrials` client class with httpx
|
|
290
|
+
- [ ] `ctg.studies.search()` with all query params
|
|
291
|
+
- [ ] `ctg.studies.get()` single study by NCT ID
|
|
292
|
+
- [ ] Pydantic models for Study, ProtocolSection, and key submodules
|
|
293
|
+
- [ ] Enum types for OverallStatus, Phase, StudyType
|
|
294
|
+
- [ ] Basic error handling (CTGError with status_code, message)
|
|
295
|
+
- [ ] pytest suite with respx mocking
|
|
296
|
+
- [ ] README with quickstart
|
|
297
|
+
- [ ] PyPI publish via GitHub Actions
|
|
298
|
+
|
|
299
|
+
### v0.2.0: Ergonomics (weeks 3-4)
|
|
300
|
+
|
|
301
|
+
Make the SDK pleasant to use for real workflows.
|
|
302
|
+
|
|
303
|
+
- [ ] QueryBuilder fluent interface with validation
|
|
304
|
+
- [ ] AsyncPaginator (async generator for search_all)
|
|
305
|
+
- [ ] `ctg.stats.size()` and `ctg.stats.field_values()`
|
|
306
|
+
- [ ] `ctg.version()` endpoint
|
|
307
|
+
- [ ] Retry with exponential backoff (tenacity)
|
|
308
|
+
- [ ] Rate limiter (token bucket, 50 req/min)
|
|
309
|
+
- [ ] CTGDate custom type for inconsistent date formats
|
|
310
|
+
- [ ] `bulk_get()` for multiple NCT IDs
|
|
311
|
+
|
|
312
|
+
### v0.3.0: Data science (weeks 5-6)
|
|
313
|
+
|
|
314
|
+
Make it useful for analysts and researchers.
|
|
315
|
+
|
|
316
|
+
- [ ] `.to_dataframe()` pandas integration
|
|
317
|
+
- [ ] CSV export option (format=csv passthrough)
|
|
318
|
+
- [ ] `ctg.metadata.fields()` and `search_areas()`
|
|
319
|
+
- [ ] mkdocs-material documentation site
|
|
320
|
+
- [ ] "Common patterns" guide (competitive intel, site selection, patient matching)
|
|
321
|
+
- [ ] mypy strict mode passing
|
|
322
|
+
- [ ] 90%+ test coverage
|
|
323
|
+
|
|
324
|
+
### v1.0.0: Stable (week 8+)
|
|
325
|
+
|
|
326
|
+
- [ ] Full model coverage for resultsSection (outcomes, adverse events, participant flow)
|
|
327
|
+
- [ ] CLI tool (`pytrials search --condition "diabetes" --status RECRUITING`)
|
|
328
|
+
- [ ] Jupyter notebook examples
|
|
329
|
+
- [ ] Docs site deployed (GitHub Pages or Vercel)
|
|
330
|
+
- [ ] Community feedback incorporated
|
|
331
|
+
|
|
332
|
+
---
|
|
333
|
+
|
|
334
|
+
## Competitive differentiation
|
|
335
|
+
|
|
336
|
+
What this SDK does that nothing else offers:
|
|
337
|
+
|
|
338
|
+
1. **Full Pydantic models.** Every field in the API response is typed. IDE autocomplete works everywhere. `study.protocol_section.eligibility_module.minimum_age` not `study["protocolSection"]["eligibilityModule"]["minimumAge"]`.
|
|
339
|
+
|
|
340
|
+
2. **QueryBuilder with validation.** Catches invalid status values, phase values, and sort options before the request is sent. No more 400 errors from typos.
|
|
341
|
+
|
|
342
|
+
3. **Auto-pagination.** `async for study in ctg.studies.search_all(...)` handles pageToken transparently. Nobody should write a pagination loop manually.
|
|
343
|
+
|
|
344
|
+
4. **pandas-native.** `.to_dataframe()` flattens the deeply nested study structure into analysis-ready columns. This is what 80% of users actually want.
|
|
345
|
+
|
|
346
|
+
5. **Date normalization.** The API returns dates in at least three formats. The SDK normalizes them into Python `date` objects.
|
|
347
|
+
|
|
348
|
+
6. **Rate limit awareness.** Built-in token bucket respects the 50 req/min limit. No more 429 errors during bulk operations.
|
|
349
|
+
|
|
350
|
+
7. **Domain expertise in the API design.** Search patterns (condition + intervention + sponsor + location + status + phase) are first-class, not afterthoughts. The QueryBuilder reflects how regulatory professionals, CROs, and biotech analysts actually query this data.
|
|
351
|
+
|
|
352
|
+
---
|
|
353
|
+
|
|
354
|
+
## Example use cases to document
|
|
355
|
+
|
|
356
|
+
### Competitive intelligence
|
|
357
|
+
Search what trials a specific sponsor is running, filter by phase and status, export to DataFrame for analysis.
|
|
358
|
+
|
|
359
|
+
### Clinical site selection
|
|
360
|
+
Find facilities running trials for a condition in a geography. The locations data in contactsLocationsModule is rich enough for this.
|
|
361
|
+
|
|
362
|
+
### Patient matching
|
|
363
|
+
Search recruiting trials by condition, location, age range, and healthy volunteer status. This is the patient-facing use case.
|
|
364
|
+
|
|
365
|
+
### Regulatory landscape
|
|
366
|
+
Count trials by phase and status for a therapeutic area. Use stats endpoints for aggregate views.
|
|
367
|
+
|
|
368
|
+
### Drug pipeline tracking
|
|
369
|
+
Track all trials for a specific intervention across phases. Use sort by last update to catch new filings.
|
|
370
|
+
|
|
371
|
+
---
|
|
372
|
+
|
|
373
|
+
## Marketing and distribution
|
|
374
|
+
|
|
375
|
+
- PyPI package with clear README
|
|
376
|
+
- Blog post on dev.to or Medium: "Building the Python SDK ClinicalTrials.gov should have shipped"
|
|
377
|
+
- Post in r/bioinformatics, r/clinicalresearch, r/Python
|
|
378
|
+
- LinkedIn post (leveraging your existing healthcare audience)
|
|
379
|
+
- Submit to awesome-python and awesome-healthcare lists
|
|
380
|
+
- Present at OOP Data Camp (you're already going)
|
|
381
|
+
- Cross-reference from OpenTrialGraph
|
|
382
|
+
|
|
383
|
+
---
|
|
384
|
+
|
|
385
|
+
## Key API quirks to handle
|
|
386
|
+
|
|
387
|
+
1. **Date inconsistency.** Some dates are "2024-01-15", others "January 2024", others "January 15, 2024". The SDK should normalize to `datetime.date` or a `CTGDate` that preserves precision (year-only vs. full date).
|
|
388
|
+
|
|
389
|
+
2. **Nullable arrays.** conditions, interventions, locations, collaborators can all be null or empty arrays. Models must handle both.
|
|
390
|
+
|
|
391
|
+
3. **Large responses.** A single study with results can be 200KB+ of JSON. The resultsSection (outcomes, adverse events, participant flow, baseline) is massive. Consider lazy loading or optional field selection.
|
|
392
|
+
|
|
393
|
+
4. **pageSize default is 10.** Most users want more. The SDK should default to 100 or let users set a client-level default.
|
|
394
|
+
|
|
395
|
+
5. **Enumerated values are case-sensitive.** RECRUITING works, recruiting doesn't. The SDK should handle case normalization.
|
|
396
|
+
|
|
397
|
+
6. **CSV format returns flat columns.** The JSON and CSV response schemas are different. The SDK should abstract this.
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
# pytrials-v2
|
|
2
|
+
|
|
3
|
+
A modern, fully-typed Python SDK for the [ClinicalTrials.gov API v2](https://clinicaltrials.gov/data-api/api).
|
|
4
|
+
|
|
5
|
+
> Status: early development. See [PROJECT_PLAN.md](./PROJECT_PLAN.md) for the full design and roadmap.
|
|
6
|
+
|
|
7
|
+
## Why
|
|
8
|
+
|
|
9
|
+
The ClinicalTrials.gov API v2 (JSON, token pagination, OpenAPI 3.0) launched in 2024, and the old XML v1 API was retired. There is still no well-designed Python SDK for v2. pytrials-v2 aims to be the default: Pydantic-modeled, async-capable, with the ergonomic helpers clinical-trial data consumers actually need.
|
|
10
|
+
|
|
11
|
+
## What it offers
|
|
12
|
+
|
|
13
|
+
- Full Pydantic v2 models for every API response (real autocomplete, no dict-digging)
|
|
14
|
+
- A validating QueryBuilder that catches bad status, phase, and sort values before the request
|
|
15
|
+
- Async auto-pagination over `pageToken`
|
|
16
|
+
- DataFrame integration that flattens the nested study structure for analysis
|
|
17
|
+
- Date normalization across the API's inconsistent formats
|
|
18
|
+
- Built-in rate limiting (50 req/min) and retry with backoff
|
|
19
|
+
|
|
20
|
+
## Quickstart (planned API)
|
|
21
|
+
|
|
22
|
+
```python
|
|
23
|
+
from pytrials import ClinicalTrials
|
|
24
|
+
|
|
25
|
+
ctg = ClinicalTrials()
|
|
26
|
+
|
|
27
|
+
results = ctg.studies.search(condition="breast cancer", status=["RECRUITING"], phase=["PHASE3"])
|
|
28
|
+
study = ctg.studies.get("NCT04852770")
|
|
29
|
+
df = ctg.studies.search(condition="diabetes", status=["RECRUITING"]).to_dataframe()
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Roadmap
|
|
33
|
+
|
|
34
|
+
- **v0.1.0 Core**: client, search/get, core models, error handling, PyPI publish
|
|
35
|
+
- **v0.2.0 Ergonomics**: QueryBuilder, async paginator, stats endpoints, rate limiting
|
|
36
|
+
- **v0.3.0 Data science**: DataFrame integration, docs site, 90%+ coverage
|
|
37
|
+
- **v1.0.0 Stable**: full results-section models, CLI, notebook examples
|
|
38
|
+
|
|
39
|
+
## License
|
|
40
|
+
|
|
41
|
+
MIT
|