@nahisaho/satori 0.9.0 → 0.11.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +188 -39
- package/package.json +1 -1
- package/src/.github/skills/scientific-clinical-trials-analytics/SKILL.md +340 -0
- package/src/.github/skills/scientific-computational-materials/SKILL.md +353 -0
- package/src/.github/skills/scientific-environmental-ecology/SKILL.md +295 -0
- package/src/.github/skills/scientific-epidemiology-public-health/SKILL.md +332 -0
- package/src/.github/skills/scientific-epigenomics-chromatin/SKILL.md +567 -0
- package/src/.github/skills/scientific-gene-expression-transcriptomics/SKILL.md +330 -0
- package/src/.github/skills/scientific-immunoinformatics/SKILL.md +341 -0
- package/src/.github/skills/scientific-infectious-disease/SKILL.md +342 -0
- package/src/.github/skills/scientific-lab-data-management/SKILL.md +334 -0
- package/src/.github/skills/scientific-microbiome-metagenomics/SKILL.md +349 -0
- package/src/.github/skills/scientific-neuroscience-electrophysiology/SKILL.md +400 -0
- package/src/.github/skills/scientific-pharmacogenomics/SKILL.md +342 -0
- package/src/.github/skills/scientific-population-genetics/SKILL.md +336 -0
- package/src/.github/skills/scientific-proteomics-mass-spectrometry/SKILL.md +401 -0
- package/src/.github/skills/scientific-regulatory-science/SKILL.md +256 -0
- package/src/.github/skills/scientific-scientific-schematics/SKILL.md +336 -0
- package/src/.github/skills/scientific-single-cell-genomics/SKILL.md +361 -0
- package/src/.github/skills/scientific-spatial-transcriptomics/SKILL.md +281 -0
- package/src/.github/skills/scientific-systems-biology/SKILL.md +310 -0
- package/src/.github/skills/scientific-text-mining-nlp/SKILL.md +358 -0
|
@@ -0,0 +1,330 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: scientific-gene-expression-transcriptomics
|
|
3
|
+
description: |
|
|
4
|
+
遺伝子発現・トランスクリプトミクス解析スキル。GEO (Gene Expression Omnibus) からの
|
|
5
|
+
公開データセット取得・前処理、DESeq2 (PyDESeq2) による差次発現解析、
|
|
6
|
+
GTEx 組織発現参照・eQTL 解析、Expression Atlas (EBI GXA) 統合照会、
|
|
7
|
+
遺伝子セット濃縮解析 (GSEA)、バルク RNA-seq カウントデータの
|
|
8
|
+
標準解析パイプライン。
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Scientific Gene Expression & Transcriptomics
|
|
12
|
+
|
|
13
|
+
バルク RNA-seq / マイクロアレイの遺伝子発現データを対象に、
|
|
14
|
+
GEO データセット取得→前処理→差次発現→GSEA→組織発現参照の
|
|
15
|
+
統合トランスクリプトミクスパイプラインを提供する。
|
|
16
|
+
|
|
17
|
+
## When to Use
|
|
18
|
+
|
|
19
|
+
- GEO からバルク RNA-seq/マイクロアレイデータセットを取得・前処理するとき
|
|
20
|
+
- DESeq2 による差次発現遺伝子 (DEG) 解析が必要なとき
|
|
21
|
+
- GTEx 組織発現プロファイル・eQTL データを照会するとき
|
|
22
|
+
- 遺伝子セット濃縮解析 (GSEA/ORA) を行うとき
|
|
23
|
+
- Expression Atlas でベースライン/差次発現実験を検索するとき
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## Quick Start
|
|
28
|
+
|
|
29
|
+
## 1. GEO データセット取得
|
|
30
|
+
|
|
31
|
+
```python
|
|
32
|
+
import pandas as pd
|
|
33
|
+
import GEOparse
|
|
34
|
+
|
|
35
|
+
|
|
36
|
+
def fetch_geo_dataset(accession, output_dir="data/geo"):
|
|
37
|
+
"""
|
|
38
|
+
GEO (Gene Expression Omnibus) データセットの取得・前処理。
|
|
39
|
+
|
|
40
|
+
GEO ID 形式:
|
|
41
|
+
- GSE: Series (発現データセット)
|
|
42
|
+
- GPL: Platform (アレイ/シーケンサー定義)
|
|
43
|
+
- GSM: Sample (個別サンプル)
|
|
44
|
+
- GDS: Dataset (キュレーション済み)
|
|
45
|
+
"""
|
|
46
|
+
import os
|
|
47
|
+
os.makedirs(output_dir, exist_ok=True)
|
|
48
|
+
|
|
49
|
+
gse = GEOparse.get_GEO(geo=accession, destdir=output_dir)
|
|
50
|
+
|
|
51
|
+
print(f" GEO Accession: {accession}")
|
|
52
|
+
print(f" Title: {gse.metadata['title'][0]}")
|
|
53
|
+
print(f" Platform: {list(gse.gpls.keys())}")
|
|
54
|
+
print(f" Samples: {len(gse.gsms)}")
|
|
55
|
+
print(f" Type: {gse.metadata.get('type', ['unknown'])}")
|
|
56
|
+
|
|
57
|
+
# サンプルメタデータ抽出
|
|
58
|
+
metadata = []
|
|
59
|
+
for gsm_name, gsm in gse.gsms.items():
|
|
60
|
+
meta = {"sample_id": gsm_name}
|
|
61
|
+
meta.update({k: v[0] if v else None
|
|
62
|
+
for k, v in gsm.metadata.items()
|
|
63
|
+
if k in ["title", "source_name_ch1", "characteristics_ch1"]})
|
|
64
|
+
metadata.append(meta)
|
|
65
|
+
|
|
66
|
+
metadata_df = pd.DataFrame(metadata)
|
|
67
|
+
|
|
68
|
+
# 発現マトリクス取得
|
|
69
|
+
pivot_df = gse.pivot_samples("VALUE")
|
|
70
|
+
print(f" Expression matrix: {pivot_df.shape[0]} genes × {pivot_df.shape[1]} samples")
|
|
71
|
+
|
|
72
|
+
return gse, metadata_df, pivot_df
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
## 2. DESeq2 差次発現解析 (PyDESeq2)
|
|
76
|
+
|
|
77
|
+
```python
|
|
78
|
+
import numpy as np
|
|
79
|
+
import pandas as pd
|
|
80
|
+
|
|
81
|
+
|
|
82
|
+
def deseq2_differential_expression(count_matrix, metadata, design_factor,
|
|
83
|
+
contrast=None, alpha=0.05,
|
|
84
|
+
lfc_threshold=1.0):
|
|
85
|
+
"""
|
|
86
|
+
PyDESeq2 による差次発現解析パイプライン。
|
|
87
|
+
|
|
88
|
+
1. カウントマトリクス入力 (genes × samples)
|
|
89
|
+
2. サイズファクター正規化 (median of ratios)
|
|
90
|
+
3. 分散推定 (shrinkage)
|
|
91
|
+
4. GLM フィッティング (NB 分布)
|
|
92
|
+
5. Wald 検定
|
|
93
|
+
6. LFC 収縮 (apeglm)
|
|
94
|
+
7. FDR 補正 (Benjamini-Hochberg)
|
|
95
|
+
"""
|
|
96
|
+
from pydeseq2.dds import DeseqDataSet
|
|
97
|
+
from pydeseq2.ds import DeseqStats
|
|
98
|
+
|
|
99
|
+
# DeseqDataSet 構築
|
|
100
|
+
dds = DeseqDataSet(
|
|
101
|
+
counts=count_matrix,
|
|
102
|
+
metadata=metadata,
|
|
103
|
+
design_factors=design_factor,
|
|
104
|
+
)
|
|
105
|
+
|
|
106
|
+
# 正規化 + 分散推定 + 統計検定
|
|
107
|
+
dds.deseq2()
|
|
108
|
+
|
|
109
|
+
# 結果取得
|
|
110
|
+
stat_res = DeseqStats(dds, contrast=contrast, alpha=alpha)
|
|
111
|
+
stat_res.summary()
|
|
112
|
+
|
|
113
|
+
results_df = stat_res.results_df.copy()
|
|
114
|
+
|
|
115
|
+
# LFC 収縮
|
|
116
|
+
stat_res.lfc_shrink(coeff=contrast)
|
|
117
|
+
results_df["log2FoldChange_shrunk"] = stat_res.results_df["log2FoldChange"]
|
|
118
|
+
|
|
119
|
+
# フィルタリング
|
|
120
|
+
sig = results_df[
|
|
121
|
+
(results_df["padj"] < alpha) &
|
|
122
|
+
(results_df["log2FoldChange"].abs() > lfc_threshold)
|
|
123
|
+
]
|
|
124
|
+
|
|
125
|
+
sig_up = sig[sig["log2FoldChange"] > 0]
|
|
126
|
+
sig_down = sig[sig["log2FoldChange"] < 0]
|
|
127
|
+
|
|
128
|
+
print(f" DESeq2 results:")
|
|
129
|
+
print(f" Total genes tested: {len(results_df)}")
|
|
130
|
+
print(f" Significant (FDR < {alpha}, |log2FC| > {lfc_threshold}):")
|
|
131
|
+
print(f" UP: {len(sig_up)}")
|
|
132
|
+
print(f" DOWN: {len(sig_down)}")
|
|
133
|
+
|
|
134
|
+
return results_df, sig
|
|
135
|
+
|
|
136
|
+
|
|
137
|
+
def generate_volcano_plot(results_df, alpha=0.05, lfc_threshold=1.0,
|
|
138
|
+
output_file="figures/volcano_rnaseq.png"):
|
|
139
|
+
"""
|
|
140
|
+
Volcano プロット生成。
|
|
141
|
+
"""
|
|
142
|
+
import matplotlib.pyplot as plt
|
|
143
|
+
|
|
144
|
+
fig, ax = plt.subplots(figsize=(8, 6))
|
|
145
|
+
|
|
146
|
+
results_df["-log10_padj"] = -np.log10(results_df["padj"].clip(lower=1e-300))
|
|
147
|
+
|
|
148
|
+
# 色分け
|
|
149
|
+
colors = []
|
|
150
|
+
for _, row in results_df.iterrows():
|
|
151
|
+
if row["padj"] < alpha and row["log2FoldChange"] > lfc_threshold:
|
|
152
|
+
colors.append("red")
|
|
153
|
+
elif row["padj"] < alpha and row["log2FoldChange"] < -lfc_threshold:
|
|
154
|
+
colors.append("blue")
|
|
155
|
+
else:
|
|
156
|
+
colors.append("gray")
|
|
157
|
+
|
|
158
|
+
ax.scatter(results_df["log2FoldChange"], results_df["-log10_padj"],
|
|
159
|
+
c=colors, alpha=0.5, s=5)
|
|
160
|
+
ax.axhline(-np.log10(alpha), color="gray", linestyle="--", lw=0.5)
|
|
161
|
+
ax.axvline(lfc_threshold, color="gray", linestyle="--", lw=0.5)
|
|
162
|
+
ax.axvline(-lfc_threshold, color="gray", linestyle="--", lw=0.5)
|
|
163
|
+
ax.set_xlabel("log2 Fold Change")
|
|
164
|
+
ax.set_ylabel("-log10(adjusted p-value)")
|
|
165
|
+
ax.set_title("Volcano Plot — Differential Expression")
|
|
166
|
+
plt.tight_layout()
|
|
167
|
+
plt.savefig(output_file, dpi=300)
|
|
168
|
+
plt.close()
|
|
169
|
+
|
|
170
|
+
return output_file
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
## 3. GTEx 組織発現・eQTL 照会
|
|
174
|
+
|
|
175
|
+
```python
|
|
176
|
+
import pandas as pd
|
|
177
|
+
|
|
178
|
+
|
|
179
|
+
def query_gtex_expression(gene_name, tissue=None):
|
|
180
|
+
"""
|
|
181
|
+
GTEx (Genotype-Tissue Expression) 組織発現プロファイル照会。
|
|
182
|
+
|
|
183
|
+
GTEx v8: 54 組織, 948 ドナー, 17,382 サンプル。
|
|
184
|
+
TPM (Transcripts Per Million) ベースの発現量。
|
|
185
|
+
"""
|
|
186
|
+
print(f" GTEx gene expression query: {gene_name}")
|
|
187
|
+
if tissue:
|
|
188
|
+
print(f" Tissue: {tissue}")
|
|
189
|
+
else:
|
|
190
|
+
print(" All tissues (54 tissue sites)")
|
|
191
|
+
|
|
192
|
+
return {"gene": gene_name, "tissue": tissue}
|
|
193
|
+
|
|
194
|
+
|
|
195
|
+
def query_gtex_eqtl(gene_name, tissue, pvalue_threshold=1e-5):
|
|
196
|
+
"""
|
|
197
|
+
GTEx eQTL (expression Quantitative Trait Loci) 照会。
|
|
198
|
+
|
|
199
|
+
eQTL = 遺伝子発現量に影響する遺伝的変異
|
|
200
|
+
- cis-eQTL: 遺伝子の ±1 Mb 以内の変異
|
|
201
|
+
- trans-eQTL: 遺伝子から離れた変異
|
|
202
|
+
"""
|
|
203
|
+
print(f" GTEx eQTL query: gene={gene_name}, tissue={tissue}")
|
|
204
|
+
print(f" P-value threshold: {pvalue_threshold}")
|
|
205
|
+
print(" Types: cis-eQTL (primary), trans-eQTL")
|
|
206
|
+
|
|
207
|
+
return {"gene": gene_name, "tissue": tissue}
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## 4. 遺伝子セット濃縮解析 (GSEA)
|
|
211
|
+
|
|
212
|
+
```python
|
|
213
|
+
import pandas as pd
|
|
214
|
+
import numpy as np
|
|
215
|
+
|
|
216
|
+
|
|
217
|
+
def gsea_preranked(ranked_gene_list, gene_sets="MSigDB_Hallmark_2020",
|
|
218
|
+
n_permutations=1000, min_size=15, max_size=500):
|
|
219
|
+
"""
|
|
220
|
+
GSEA (Gene Set Enrichment Analysis) — Preranked。
|
|
221
|
+
|
|
222
|
+
入力: log2FC × -log10(p) でランク付けされた遺伝子リスト
|
|
223
|
+
遺伝子セットDB:
|
|
224
|
+
- MSigDB Hallmark (H)
|
|
225
|
+
- GO Biological Process (C5:BP)
|
|
226
|
+
- KEGG Pathways (C2:KEGG)
|
|
227
|
+
- Reactome (C2:REACTOME)
|
|
228
|
+
"""
|
|
229
|
+
import gseapy as gp
|
|
230
|
+
|
|
231
|
+
# ランクスコア = sign(log2FC) × -log10(pvalue)
|
|
232
|
+
results = gp.prerank(
|
|
233
|
+
rnk=ranked_gene_list,
|
|
234
|
+
gene_sets=gene_sets,
|
|
235
|
+
processes=4,
|
|
236
|
+
permutation_num=n_permutations,
|
|
237
|
+
min_size=min_size,
|
|
238
|
+
max_size=max_size,
|
|
239
|
+
outdir="results/gsea",
|
|
240
|
+
seed=42,
|
|
241
|
+
)
|
|
242
|
+
|
|
243
|
+
sig_terms = results.res2d[results.res2d["FDR q-val"] < 0.05]
|
|
244
|
+
|
|
245
|
+
print(f" GSEA results ({gene_sets}):")
|
|
246
|
+
print(f" Gene sets tested: {len(results.res2d)}")
|
|
247
|
+
print(f" Significant (FDR < 0.05): {len(sig_terms)}")
|
|
248
|
+
if len(sig_terms) > 0:
|
|
249
|
+
print(f" Top enriched:")
|
|
250
|
+
for _, row in sig_terms.head(5).iterrows():
|
|
251
|
+
direction = "UP" if row["NES"] > 0 else "DOWN"
|
|
252
|
+
print(f" {row['Term']} (NES={row['NES']:.2f}, {direction})")
|
|
253
|
+
|
|
254
|
+
return results
|
|
255
|
+
|
|
256
|
+
|
|
257
|
+
def overrepresentation_analysis(gene_list, background=None,
|
|
258
|
+
gene_sets="GO_Biological_Process_2021"):
|
|
259
|
+
"""
|
|
260
|
+
遺伝子オーバーリプレゼンテーション解析 (ORA)。
|
|
261
|
+
|
|
262
|
+
Fisher exact test ベースの濃縮解析。
|
|
263
|
+
DEG リスト → 機能カテゴリへのマッピング。
|
|
264
|
+
"""
|
|
265
|
+
import gseapy as gp
|
|
266
|
+
|
|
267
|
+
results = gp.enrich(
|
|
268
|
+
gene_list=gene_list,
|
|
269
|
+
gene_sets=gene_sets,
|
|
270
|
+
background=background,
|
|
271
|
+
outdir="results/ora",
|
|
272
|
+
)
|
|
273
|
+
|
|
274
|
+
sig = results.res2d[results.res2d["Adjusted P-value"] < 0.05]
|
|
275
|
+
|
|
276
|
+
print(f" ORA results ({gene_sets}):")
|
|
277
|
+
print(f" Input genes: {len(gene_list)}")
|
|
278
|
+
print(f" Significant terms: {len(sig)}")
|
|
279
|
+
|
|
280
|
+
return results
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
## References
|
|
284
|
+
|
|
285
|
+
### Output Files
|
|
286
|
+
|
|
287
|
+
| ファイル | 形式 |
|
|
288
|
+
|---|---|
|
|
289
|
+
| `results/geo_expression_matrix.csv` | CSV |
|
|
290
|
+
| `results/deseq2_results.csv` | CSV |
|
|
291
|
+
| `results/gsea/` | ディレクトリ |
|
|
292
|
+
| `results/ora/` | ディレクトリ |
|
|
293
|
+
| `figures/volcano_rnaseq.png` | PNG |
|
|
294
|
+
| `figures/ma_plot.png` | PNG |
|
|
295
|
+
| `figures/gsea_dotplot.png` | PNG |
|
|
296
|
+
|
|
297
|
+
### 利用可能ツール
|
|
298
|
+
|
|
299
|
+
> [ToolUniverse](https://github.com/mims-harvard/ToolUniverse) SMCP 経由で利用可能な外部ツール。
|
|
300
|
+
|
|
301
|
+
| カテゴリ | 主要ツール | 用途 |
|
|
302
|
+
|---|---|---|
|
|
303
|
+
| GEO | `geo_search_datasets` | GEO データセット検索 |
|
|
304
|
+
| GEO | `geo_get_dataset_info` | データセット詳細取得 |
|
|
305
|
+
| GEO | `geo_get_sample_info` | サンプル情報取得 |
|
|
306
|
+
| GTEx | `GTEx_get_median_gene_expression` | 組織間中央値発現量 |
|
|
307
|
+
| GTEx | `GTEx_get_gene_expression` | サンプルレベル発現データ |
|
|
308
|
+
| GTEx | `GTEx_get_top_expressed_genes` | 高発現遺伝子取得 |
|
|
309
|
+
| GTEx | `GTEx_get_eqtl_genes` | eQTL 遺伝子 (eGenes) |
|
|
310
|
+
| GTEx | `GTEx_get_single_tissue_eqtls` | 単一組織 eQTL |
|
|
311
|
+
| GTEx | `GTEx_get_multi_tissue_eqtls` | 多組織 eQTL |
|
|
312
|
+
| GTEx | `GTEx_calculate_eqtl` | eQTL 計算 |
|
|
313
|
+
| Expression Atlas | `ExpressionAtlas_search_experiments` | 実験検索 |
|
|
314
|
+
| Expression Atlas | `ExpressionAtlas_get_baseline` | ベースライン発現 |
|
|
315
|
+
| Expression Atlas | `ExpressionAtlas_search_differential` | 差次発現実験 |
|
|
316
|
+
| ArrayExpress | `arrayexpress_search_experiments` | ArrayExpress 実験検索 |
|
|
317
|
+
|
|
318
|
+
### 参照スキル
|
|
319
|
+
|
|
320
|
+
| スキル | 関連 |
|
|
321
|
+
|---|---|
|
|
322
|
+
| `scientific-bioinformatics` | バルク RNA-seq 基盤 |
|
|
323
|
+
| `scientific-single-cell-genomics` | scRNA-seq (単一細胞) |
|
|
324
|
+
| `scientific-epigenomics-chromatin` | 発現-エピゲノム統合 |
|
|
325
|
+
| `scientific-multi-omics` | マルチオミクス統合 |
|
|
326
|
+
| `scientific-network-analysis` | 共発現ネットワーク |
|
|
327
|
+
|
|
328
|
+
### 依存パッケージ
|
|
329
|
+
|
|
330
|
+
`pydeseq2`, `GEOparse`, `gseapy`, `pandas`, `numpy`, `matplotlib`, `scipy`
|
|
@@ -0,0 +1,341 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: scientific-immunoinformatics
|
|
3
|
+
description: |
|
|
4
|
+
免疫情報学スキル。エピトープ予測(MHC-I/II バインディング)・
|
|
5
|
+
T 細胞/B 細胞エピトープマッピング・抗体構造解析(CDR ループ)・
|
|
6
|
+
免疫レパトア解析(TCR/BCR クロノタイプ)・ワクチン候補設計・
|
|
7
|
+
IEDB/IMGT/SAbDab データベース統合パイプライン。
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Scientific Immunoinformatics
|
|
11
|
+
|
|
12
|
+
免疫情報学(Immunoinformatics)に特化した解析パイプラインを提供する。
|
|
13
|
+
エピトープ予測、MHC 結合親和性推定、抗体配列・構造解析、
|
|
14
|
+
免疫レパトア多様性解析、ワクチン候補優先順位付けを体系的に扱う。
|
|
15
|
+
|
|
16
|
+
## When to Use
|
|
17
|
+
|
|
18
|
+
- ペプチド-MHC 結合親和性を予測するとき
|
|
19
|
+
- T 細胞 / B 細胞エピトープを同定・マッピングするとき
|
|
20
|
+
- TCR / BCR レパトア(クロノタイプ)多様性を解析するとき
|
|
21
|
+
- 抗体 CDR ループの構造モデリングを行うとき
|
|
22
|
+
- ワクチン候補アンチゲンの優先順位付けを行うとき
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Quick Start
|
|
27
|
+
|
|
28
|
+
## 1. MHC-I バインディング予測
|
|
29
|
+
|
|
30
|
+
```python
|
|
31
|
+
import numpy as np
|
|
32
|
+
import pandas as pd
|
|
33
|
+
|
|
34
|
+
def predict_mhc_binding(peptides, alleles, method="netmhcpan"):
|
|
35
|
+
"""
|
|
36
|
+
MHC クラス I バインディング親和性予測。
|
|
37
|
+
|
|
38
|
+
method:
|
|
39
|
+
- "netmhcpan": NetMHCpan 4.1 — ペプチド-MHC 結合 IC50 予測
|
|
40
|
+
- "mhcflurry": MHCflurry 2.0 — ニューラルネットワークベース
|
|
41
|
+
|
|
42
|
+
閾値:
|
|
43
|
+
- Strong binder: IC50 < 50 nM (または %Rank < 0.5)
|
|
44
|
+
- Weak binder: IC50 < 500 nM (または %Rank < 2.0)
|
|
45
|
+
|
|
46
|
+
Parameters:
|
|
47
|
+
peptides: ペプチド配列リスト(8-14 mer)
|
|
48
|
+
alleles: HLA アレルリスト (e.g., ["HLA-A*02:01", "HLA-B*07:02"])
|
|
49
|
+
"""
|
|
50
|
+
from mhcflurry import Class1PresentationPredictor
|
|
51
|
+
|
|
52
|
+
predictor = Class1PresentationPredictor.load()
|
|
53
|
+
|
|
54
|
+
results = []
|
|
55
|
+
for peptide in peptides:
|
|
56
|
+
for allele in alleles:
|
|
57
|
+
pred = predictor.predict(peptides=[peptide], alleles=[allele],
|
|
58
|
+
verbose=0)
|
|
59
|
+
results.append({
|
|
60
|
+
"peptide": peptide,
|
|
61
|
+
"allele": allele,
|
|
62
|
+
"affinity_nM": pred["affinity"].values[0],
|
|
63
|
+
"percentile_rank": pred["affinity_percentile"].values[0],
|
|
64
|
+
"processing_score": pred["processing_score"].values[0],
|
|
65
|
+
"presentation_score": pred["presentation_score"].values[0],
|
|
66
|
+
})
|
|
67
|
+
|
|
68
|
+
df = pd.DataFrame(results)
|
|
69
|
+
df["binding_level"] = np.where(
|
|
70
|
+
df["affinity_nM"] < 50, "Strong",
|
|
71
|
+
np.where(df["affinity_nM"] < 500, "Weak", "Non-binder")
|
|
72
|
+
)
|
|
73
|
+
|
|
74
|
+
n_strong = (df["binding_level"] == "Strong").sum()
|
|
75
|
+
n_weak = (df["binding_level"] == "Weak").sum()
|
|
76
|
+
print(f" MHC-I: {n_strong} strong + {n_weak} weak binders / {len(df)} predictions")
|
|
77
|
+
return df
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## 2. B 細胞エピトープ予測
|
|
81
|
+
|
|
82
|
+
```python
|
|
83
|
+
def predict_bcell_epitopes(sequence, window_size=20, threshold=0.5):
|
|
84
|
+
"""
|
|
85
|
+
B 細胞(線状)エピトープ予測。
|
|
86
|
+
|
|
87
|
+
統合スコアリング:
|
|
88
|
+
1. BepiPred 2.0: Random Forest ベース予測
|
|
89
|
+
2. Parker hydrophilicity scale
|
|
90
|
+
3. Emini surface accessibility
|
|
91
|
+
4. Chou-Fasman β-turn prediction
|
|
92
|
+
|
|
93
|
+
combined_score = 0.4 * bepipred + 0.2 * hydrophilicity +
|
|
94
|
+
0.2 * surface + 0.2 * beta_turn
|
|
95
|
+
"""
|
|
96
|
+
from Bio.SeqUtils.ProtParam import ProteinAnalysis
|
|
97
|
+
|
|
98
|
+
pa = ProteinAnalysis(str(sequence))
|
|
99
|
+
|
|
100
|
+
# Parker hydrophilicity
|
|
101
|
+
hydrophilicity = pa.protein_scale(window=window_size,
|
|
102
|
+
param_dict="Parker")
|
|
103
|
+
|
|
104
|
+
# 簡易 B 細胞エピトープスコア
|
|
105
|
+
from Bio.SeqUtils.ProtParam import ProtParamData
|
|
106
|
+
flexibility = pa.flexibility()
|
|
107
|
+
|
|
108
|
+
epitopes = []
|
|
109
|
+
for i in range(len(sequence) - window_size + 1):
|
|
110
|
+
window = sequence[i:i + window_size]
|
|
111
|
+
score = np.mean([
|
|
112
|
+
hydrophilicity[i] if i < len(hydrophilicity) else 0,
|
|
113
|
+
flexibility[i] if i < len(flexibility) else 0,
|
|
114
|
+
])
|
|
115
|
+
if score > threshold:
|
|
116
|
+
epitopes.append({
|
|
117
|
+
"start": i + 1,
|
|
118
|
+
"end": i + window_size,
|
|
119
|
+
"sequence": window,
|
|
120
|
+
"score": score,
|
|
121
|
+
})
|
|
122
|
+
|
|
123
|
+
df = pd.DataFrame(epitopes)
|
|
124
|
+
print(f" B-cell epitopes: {len(df)} predicted (threshold={threshold})")
|
|
125
|
+
return df
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## 3. TCR/BCR レパトア解析
|
|
129
|
+
|
|
130
|
+
```python
|
|
131
|
+
def repertoire_analysis(clonotype_df, chain="TRB",
|
|
132
|
+
clone_col="cdr3_aa", count_col="clone_count"):
|
|
133
|
+
"""
|
|
134
|
+
TCR/BCR レパトア多様性解析。
|
|
135
|
+
|
|
136
|
+
多様性指標:
|
|
137
|
+
- Shannon entropy: H = -Σ pᵢ log₂(pᵢ)
|
|
138
|
+
- Simpson index: D = 1 - Σ pᵢ²
|
|
139
|
+
- Chao1 estimator: S_est = S_obs + f₁²/(2·f₂)
|
|
140
|
+
- Clonality: 1 - H/log₂(N)
|
|
141
|
+
- Gini coefficient: 均等性の指標
|
|
142
|
+
|
|
143
|
+
Parameters:
|
|
144
|
+
clonotype_df: クロノタイプ DataFrame (cdr3_aa, clone_count)
|
|
145
|
+
chain: TCR/BCR 鎖 (TRA, TRB, IGH, IGL, IGK)
|
|
146
|
+
"""
|
|
147
|
+
from scipy.stats import entropy
|
|
148
|
+
|
|
149
|
+
counts = clonotype_df[count_col].values
|
|
150
|
+
total = counts.sum()
|
|
151
|
+
freqs = counts / total
|
|
152
|
+
|
|
153
|
+
# Shannon entropy
|
|
154
|
+
H = entropy(freqs, base=2)
|
|
155
|
+
# Simpson index
|
|
156
|
+
D = 1 - np.sum(freqs ** 2)
|
|
157
|
+
# Clonality
|
|
158
|
+
n_clones = len(counts)
|
|
159
|
+
clonality = 1 - H / np.log2(n_clones) if n_clones > 1 else 0
|
|
160
|
+
|
|
161
|
+
# Chao1
|
|
162
|
+
f1 = np.sum(counts == 1) # singletons
|
|
163
|
+
f2 = np.sum(counts == 2) # doubletons
|
|
164
|
+
chao1 = n_clones + (f1 ** 2) / (2 * max(f2, 1))
|
|
165
|
+
|
|
166
|
+
# Gini coefficient
|
|
167
|
+
sorted_freqs = np.sort(freqs)
|
|
168
|
+
n = len(sorted_freqs)
|
|
169
|
+
gini = (2 * np.sum((np.arange(1, n + 1)) * sorted_freqs) / (n * np.sum(sorted_freqs))) - (n + 1) / n
|
|
170
|
+
|
|
171
|
+
# Top clones
|
|
172
|
+
top10 = clonotype_df.nlargest(10, count_col)
|
|
173
|
+
|
|
174
|
+
metrics = {
|
|
175
|
+
"chain": chain,
|
|
176
|
+
"n_clonotypes": n_clones,
|
|
177
|
+
"total_cells": int(total),
|
|
178
|
+
"shannon_entropy": round(H, 4),
|
|
179
|
+
"simpson_index": round(D, 4),
|
|
180
|
+
"clonality": round(clonality, 4),
|
|
181
|
+
"chao1": round(chao1, 1),
|
|
182
|
+
"gini": round(gini, 4),
|
|
183
|
+
"top1_frequency": round(freqs[0], 4) if len(freqs) > 0 else 0,
|
|
184
|
+
}
|
|
185
|
+
|
|
186
|
+
print(f" Repertoire ({chain}): {n_clones} clonotypes, "
|
|
187
|
+
f"Shannon={H:.3f}, Clonality={clonality:.3f}")
|
|
188
|
+
return metrics, top10
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
## 4. 抗体構造解析
|
|
192
|
+
|
|
193
|
+
```python
|
|
194
|
+
def antibody_structure_analysis(vh_seq, vl_seq, numbering="imgt"):
|
|
195
|
+
"""
|
|
196
|
+
抗体可変領域の構造解析。
|
|
197
|
+
|
|
198
|
+
パイプライン:
|
|
199
|
+
1. ANARCI ナンバリング(IMGT / Kabat / Chothia)
|
|
200
|
+
2. CDR ループ同定(CDR-H1/H2/H3, CDR-L1/L2/L3)
|
|
201
|
+
3. フレームワーク領域(FR1-FR4)抽出
|
|
202
|
+
4. 発生確率・体細胞超変異(SHM)率推定
|
|
203
|
+
5. ヒト化可能性スコア
|
|
204
|
+
|
|
205
|
+
CDR 定義(IMGT 方式):
|
|
206
|
+
CDR-H1: 26-33 (8 残基)
|
|
207
|
+
CDR-H2: 51-57 (7 残基)
|
|
208
|
+
CDR-H3: 93-102 (可変長)
|
|
209
|
+
"""
|
|
210
|
+
from anarci import anarci
|
|
211
|
+
|
|
212
|
+
# ナンバリング
|
|
213
|
+
vh_numbered = anarci([("VH", vh_seq)], scheme=numbering)
|
|
214
|
+
vl_numbered = anarci([("VL", vl_seq)], scheme=numbering)
|
|
215
|
+
|
|
216
|
+
# CDR 抽出(IMGT 方式)
|
|
217
|
+
cdr_regions = {
|
|
218
|
+
"CDR-H1": (26, 33), "CDR-H2": (51, 57), "CDR-H3": (93, 102),
|
|
219
|
+
"CDR-L1": (27, 32), "CDR-L2": (50, 52), "CDR-L3": (89, 97),
|
|
220
|
+
}
|
|
221
|
+
|
|
222
|
+
cdrs = {}
|
|
223
|
+
for name, (start, end) in cdr_regions.items():
|
|
224
|
+
chain_data = vh_numbered if "H" in name else vl_numbered
|
|
225
|
+
seq = extract_region(chain_data, start, end)
|
|
226
|
+
cdrs[name] = seq
|
|
227
|
+
|
|
228
|
+
# SHM 率(生殖系列との差分)推定
|
|
229
|
+
def estimate_shm_rate(numbered_seq, germline_db="imgt"):
|
|
230
|
+
"""生殖系列配列との差異から SHM 率を推定"""
|
|
231
|
+
# 簡易実装: 生殖系列との一致率
|
|
232
|
+
return 0.0 # 要生殖系列 DB
|
|
233
|
+
|
|
234
|
+
result = {
|
|
235
|
+
"cdrs": cdrs,
|
|
236
|
+
"vh_length": len(vh_seq),
|
|
237
|
+
"vl_length": len(vl_seq),
|
|
238
|
+
"cdr_h3_length": len(cdrs.get("CDR-H3", "")),
|
|
239
|
+
"numbering": numbering,
|
|
240
|
+
}
|
|
241
|
+
|
|
242
|
+
print(f" Antibody: CDR-H3 length={result['cdr_h3_length']}, "
|
|
243
|
+
f"scheme={numbering}")
|
|
244
|
+
return result
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
## 5. ワクチン候補優先順位付け
|
|
248
|
+
|
|
249
|
+
```python
|
|
250
|
+
def vaccine_candidate_ranking(antigens_df, weights=None):
|
|
251
|
+
"""
|
|
252
|
+
ワクチン候補アンチゲンの多基準優先順位付け。
|
|
253
|
+
|
|
254
|
+
評価基準:
|
|
255
|
+
1. Antigenicity score: VaxiJen 2.0 スコア(閾値 > 0.4)
|
|
256
|
+
2. Allergenicity: AllerTOP 非アレルゲン性
|
|
257
|
+
3. Toxicity: ToxinPred 非毒性
|
|
258
|
+
4. MHC coverage: HLA supertype カバー率
|
|
259
|
+
5. Conservation: 配列保存性(多株間)
|
|
260
|
+
6. Surface accessibility: 表面露出度
|
|
261
|
+
|
|
262
|
+
Composite score = Σ wᵢ · normalized_scoreᵢ
|
|
263
|
+
"""
|
|
264
|
+
if weights is None:
|
|
265
|
+
weights = {
|
|
266
|
+
"antigenicity": 0.25,
|
|
267
|
+
"mhc_coverage": 0.25,
|
|
268
|
+
"conservation": 0.20,
|
|
269
|
+
"surface_accessibility": 0.15,
|
|
270
|
+
"non_allergenicity": 0.10,
|
|
271
|
+
"non_toxicity": 0.05,
|
|
272
|
+
}
|
|
273
|
+
|
|
274
|
+
# Min-max 正規化
|
|
275
|
+
for col in weights.keys():
|
|
276
|
+
if col in antigens_df.columns:
|
|
277
|
+
min_val = antigens_df[col].min()
|
|
278
|
+
max_val = antigens_df[col].max()
|
|
279
|
+
if max_val > min_val:
|
|
280
|
+
antigens_df[f"{col}_norm"] = (antigens_df[col] - min_val) / (max_val - min_val)
|
|
281
|
+
else:
|
|
282
|
+
antigens_df[f"{col}_norm"] = 1.0
|
|
283
|
+
|
|
284
|
+
# Composite スコア
|
|
285
|
+
antigens_df["composite_score"] = sum(
|
|
286
|
+
w * antigens_df.get(f"{col}_norm", 0)
|
|
287
|
+
for col, w in weights.items()
|
|
288
|
+
)
|
|
289
|
+
|
|
290
|
+
antigens_df = antigens_df.sort_values("composite_score", ascending=False)
|
|
291
|
+
print(f" Vaccine candidates: {len(antigens_df)} antigens ranked")
|
|
292
|
+
return antigens_df
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
## References
|
|
296
|
+
|
|
297
|
+
### Output Files
|
|
298
|
+
|
|
299
|
+
| ファイル | 形式 |
|
|
300
|
+
|---|---|
|
|
301
|
+
| `results/mhc_binding_predictions.csv` | CSV |
|
|
302
|
+
| `results/bcell_epitopes.csv` | CSV |
|
|
303
|
+
| `results/repertoire_diversity.json` | JSON |
|
|
304
|
+
| `results/antibody_structure.json` | JSON |
|
|
305
|
+
| `results/vaccine_candidates_ranked.csv` | CSV |
|
|
306
|
+
| `figures/epitope_map.png` | PNG |
|
|
307
|
+
| `figures/repertoire_clonality.png` | PNG |
|
|
308
|
+
|
|
309
|
+
### 利用可能ツール
|
|
310
|
+
|
|
311
|
+
> [ToolUniverse](https://github.com/mims-harvard/ToolUniverse) SMCP 経由で利用可能な外部ツール。
|
|
312
|
+
|
|
313
|
+
| カテゴリ | 主要ツール | 用途 |
|
|
314
|
+
|---|---|---|
|
|
315
|
+
| IEDB | `iedb_search_epitopes` | エピトープ検索 |
|
|
316
|
+
| IEDB | `iedb_get_epitope_mhc` | エピトープ-MHC 結合データ |
|
|
317
|
+
| IEDB | `iedb_search_bcell` | B 細胞エピトープ検索 |
|
|
318
|
+
| IEDB | `iedb_search_mhc` | MHC アレル検索 |
|
|
319
|
+
| IEDB | `iedb_search_antigens` | 抗原検索 |
|
|
320
|
+
| IMGT | `IMGT_get_gene_info` | 免疫遺伝子情報 |
|
|
321
|
+
| IMGT | `IMGT_get_sequence` | 免疫グロブリン配列取得 |
|
|
322
|
+
| IMGT | `IMGT_search_genes` | 免疫遺伝子検索 |
|
|
323
|
+
| SAbDab | `SAbDab_search_structures` | 抗体構造検索 |
|
|
324
|
+
| SAbDab | `SAbDab_get_structure` | 抗体構造取得 |
|
|
325
|
+
| TheraSAbDab | `TheraSAbDab_search_therapeutics` | 治療用抗体検索 |
|
|
326
|
+
| TheraSAbDab | `TheraSAbDab_search_by_target` | 標的別治療用抗体 |
|
|
327
|
+
| UniProt | `UniProt_get_entry_by_accession` | タンパク質情報取得 |
|
|
328
|
+
|
|
329
|
+
### 参照スキル
|
|
330
|
+
|
|
331
|
+
| スキル | 連携内容 |
|
|
332
|
+
|---|---|
|
|
333
|
+
| [scientific-sequence-analysis](../scientific-sequence-analysis/SKILL.md) | 配列アライメント・保存性解析 |
|
|
334
|
+
| [scientific-protein-structure-analysis](../scientific-protein-structure-analysis/SKILL.md) | 抗体 3D 構造解析 |
|
|
335
|
+
| [scientific-protein-design](../scientific-protein-design/SKILL.md) | 抗体エンジニアリング |
|
|
336
|
+
| [scientific-variant-interpretation](../scientific-variant-interpretation/SKILL.md) | HLA タイピング・バリアント解釈 |
|
|
337
|
+
| [scientific-single-cell-genomics](../scientific-single-cell-genomics/SKILL.md) | 免疫細胞サブタイプ解析 |
|
|
338
|
+
|
|
339
|
+
#### 依存パッケージ
|
|
340
|
+
|
|
341
|
+
- mhcflurry, anarci, biopython, immcantation, scirpy
|