@nahisaho/satori 0.8.0 → 0.10.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +138 -2
- package/package.json +1 -1
- package/src/.github/skills/scientific-admet-pharmacokinetics/SKILL.md +14 -0
- package/src/.github/skills/scientific-bioinformatics/SKILL.md +13 -0
- package/src/.github/skills/scientific-cheminformatics/SKILL.md +13 -0
- package/src/.github/skills/scientific-citation-checker/SKILL.md +12 -0
- package/src/.github/skills/scientific-clinical-decision-support/SKILL.md +14 -0
- package/src/.github/skills/scientific-deep-research/SKILL.md +15 -0
- package/src/.github/skills/scientific-disease-research/SKILL.md +14 -0
- package/src/.github/skills/scientific-drug-repurposing/SKILL.md +14 -0
- package/src/.github/skills/scientific-drug-target-profiling/SKILL.md +14 -0
- package/src/.github/skills/scientific-environmental-ecology/SKILL.md +295 -0
- package/src/.github/skills/scientific-epidemiology-public-health/SKILL.md +332 -0
- package/src/.github/skills/scientific-grant-writing/SKILL.md +12 -0
- package/src/.github/skills/scientific-graph-neural-networks/SKILL.md +12 -0
- package/src/.github/skills/scientific-immunoinformatics/SKILL.md +341 -0
- package/src/.github/skills/scientific-infectious-disease/SKILL.md +342 -0
- package/src/.github/skills/scientific-meta-analysis/SKILL.md +11 -0
- package/src/.github/skills/scientific-metabolomics/SKILL.md +13 -0
- package/src/.github/skills/scientific-microbiome-metagenomics/SKILL.md +349 -0
- package/src/.github/skills/scientific-multi-omics/SKILL.md +13 -0
- package/src/.github/skills/scientific-network-analysis/SKILL.md +13 -0
- package/src/.github/skills/scientific-pharmacovigilance/SKILL.md +15 -0
- package/src/.github/skills/scientific-population-genetics/SKILL.md +336 -0
- package/src/.github/skills/scientific-precision-oncology/SKILL.md +14 -0
- package/src/.github/skills/scientific-protein-design/SKILL.md +13 -0
- package/src/.github/skills/scientific-protein-structure-analysis/SKILL.md +13 -0
- package/src/.github/skills/scientific-sequence-analysis/SKILL.md +13 -0
- package/src/.github/skills/scientific-single-cell-genomics/SKILL.md +361 -0
- package/src/.github/skills/scientific-spatial-transcriptomics/SKILL.md +281 -0
- package/src/.github/skills/scientific-survival-clinical/SKILL.md +12 -0
- package/src/.github/skills/scientific-systems-biology/SKILL.md +310 -0
- package/src/.github/skills/scientific-text-mining-nlp/SKILL.md +358 -0
- package/src/.github/skills/scientific-variant-interpretation/SKILL.md +14 -0
|
@@ -0,0 +1,310 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: scientific-systems-biology
|
|
3
|
+
description: |
|
|
4
|
+
システム生物学解析スキル。動的モデリング(ODE / SBML)・
|
|
5
|
+
代謝フラックス解析(FBA / pFBA)・遺伝子制御ネットワーク推定(GRN)・
|
|
6
|
+
シグナル伝達経路モデリング・パラメータ推定・感度解析・
|
|
7
|
+
BioModels/Reactome/KEGG/BiGG 統合パイプライン。
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Scientific Systems Biology
|
|
11
|
+
|
|
12
|
+
システム生物学の定量的モデリングパイプラインを提供する。
|
|
13
|
+
ODE ベースの動的モデル、フラックスバランス解析(FBA)、
|
|
14
|
+
遺伝子制御ネットワーク(GRN)推定、パラメータ推定・感度解析を扱い、
|
|
15
|
+
BioModels・Reactome・KEGG・BiGG の統合的活用を支援する。
|
|
16
|
+
|
|
17
|
+
## When to Use
|
|
18
|
+
|
|
19
|
+
- 生物学的パスウェイの動的モデリング(ODE)が必要なとき
|
|
20
|
+
- 代謝ネットワークのフラックスバランス解析(FBA)を行うとき
|
|
21
|
+
- 遺伝子制御ネットワーク(GRN)を推定するとき
|
|
22
|
+
- BioModels / SBML モデルの取得・シミュレーションを行うとき
|
|
23
|
+
- モデルパラメータの推定・感度解析を行うとき
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## Quick Start
|
|
28
|
+
|
|
29
|
+
## 1. SBML モデルシミュレーション
|
|
30
|
+
|
|
31
|
+
```python
|
|
32
|
+
import numpy as np
|
|
33
|
+
import pandas as pd
|
|
34
|
+
|
|
35
|
+
def simulate_sbml_model(sbml_file, duration=100, n_points=1000):
|
|
36
|
+
"""
|
|
37
|
+
SBML モデルのシミュレーション。
|
|
38
|
+
|
|
39
|
+
SBML (Systems Biology Markup Language):
|
|
40
|
+
生物学的モデルの標準交換フォーマット。BioModels DB に 1,000+ モデル収録。
|
|
41
|
+
|
|
42
|
+
手順:
|
|
43
|
+
1. SBML → RoadRunner ロード
|
|
44
|
+
2. 初期条件設定
|
|
45
|
+
3. 時間発展シミュレーション
|
|
46
|
+
4. 結果抽出・可視化
|
|
47
|
+
|
|
48
|
+
対応モデル:
|
|
49
|
+
- ODE ベース(決定論的)
|
|
50
|
+
- Stochastic(Gillespie SSA)
|
|
51
|
+
- Hybrid
|
|
52
|
+
"""
|
|
53
|
+
import roadrunner
|
|
54
|
+
|
|
55
|
+
rr = roadrunner.RoadRunner(sbml_file)
|
|
56
|
+
result = rr.simulate(0, duration, n_points)
|
|
57
|
+
|
|
58
|
+
df = pd.DataFrame(result, columns=result.colnames)
|
|
59
|
+
species = [c for c in df.columns if c != "time"]
|
|
60
|
+
|
|
61
|
+
print(f" SBML: {len(species)} species simulated over t=[0, {duration}]")
|
|
62
|
+
print(f" Species: {', '.join(species[:5])}{'...' if len(species) > 5 else ''}")
|
|
63
|
+
return df, rr
|
|
64
|
+
|
|
65
|
+
|
|
66
|
+
def steady_state_analysis(rr):
|
|
67
|
+
"""
|
|
68
|
+
定常状態解析。
|
|
69
|
+
|
|
70
|
+
定常状態: dx/dt = f(x, p) = 0
|
|
71
|
+
ヤコビアン J の固有値 → 安定性判定:
|
|
72
|
+
- Re(λᵢ) < 0 ∀i: 安定平衡点
|
|
73
|
+
- ∃i: Re(λᵢ) > 0: 不安定
|
|
74
|
+
"""
|
|
75
|
+
rr.steadyState()
|
|
76
|
+
species_ids = rr.getFloatingSpeciesIds()
|
|
77
|
+
ss_values = rr.getFloatingSpeciesConcentrations()
|
|
78
|
+
|
|
79
|
+
# ヤコビアン
|
|
80
|
+
jac = rr.getFullJacobian()
|
|
81
|
+
eigenvalues = np.linalg.eigvals(jac)
|
|
82
|
+
stable = all(np.real(eigenvalues) < 0)
|
|
83
|
+
|
|
84
|
+
ss_dict = dict(zip(species_ids, ss_values))
|
|
85
|
+
ss_dict["stable"] = stable
|
|
86
|
+
ss_dict["eigenvalues"] = eigenvalues.tolist()
|
|
87
|
+
|
|
88
|
+
print(f" Steady state: {'Stable' if stable else 'Unstable'}")
|
|
89
|
+
return ss_dict
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
## 2. フラックスバランス解析(FBA)
|
|
93
|
+
|
|
94
|
+
```python
|
|
95
|
+
def flux_balance_analysis(model_path, objective="biomass", method="fba"):
|
|
96
|
+
"""
|
|
97
|
+
代謝フラックスバランス解析。
|
|
98
|
+
|
|
99
|
+
FBA 定式化:
|
|
100
|
+
max c^T · v (目的関数、通常 biomass)
|
|
101
|
+
s.t. S · v = 0 (定常状態制約)
|
|
102
|
+
vₘᵢₙ ≤ v ≤ vₘₐₓ (フラックス範囲制約)
|
|
103
|
+
|
|
104
|
+
method:
|
|
105
|
+
- "fba": 標準 FBA — LP で最適フラックス分布を求める
|
|
106
|
+
- "pfba": Parsimonious FBA — 最小総フラックスで最適化
|
|
107
|
+
- "fva": Flux Variability Analysis — 各反応の許容フラックス範囲
|
|
108
|
+
- "loopless": ループフリー FBA
|
|
109
|
+
|
|
110
|
+
入力: SBML / JSON / YAML 形式のゲノムスケール代謝モデル(GEM)
|
|
111
|
+
BiGG Models DB: 100+ 生物種の GEM を収録
|
|
112
|
+
"""
|
|
113
|
+
import cobra
|
|
114
|
+
|
|
115
|
+
model = cobra.io.read_sbml_model(model_path)
|
|
116
|
+
print(f" Model: {model.id} — {len(model.reactions)} reactions, "
|
|
117
|
+
f"{len(model.metabolites)} metabolites, {len(model.genes)} genes")
|
|
118
|
+
|
|
119
|
+
if method == "fba":
|
|
120
|
+
solution = model.optimize()
|
|
121
|
+
elif method == "pfba":
|
|
122
|
+
solution = cobra.flux_analysis.pfba(model)
|
|
123
|
+
elif method == "fva":
|
|
124
|
+
fva_result = cobra.flux_analysis.flux_variability_analysis(
|
|
125
|
+
model, fraction_of_optimum=0.9)
|
|
126
|
+
return fva_result
|
|
127
|
+
|
|
128
|
+
# 結果
|
|
129
|
+
objective_value = solution.objective_value
|
|
130
|
+
fluxes = solution.fluxes
|
|
131
|
+
|
|
132
|
+
# Essential genes (single gene knockouts)
|
|
133
|
+
essential = []
|
|
134
|
+
for gene in model.genes:
|
|
135
|
+
with model:
|
|
136
|
+
gene.knock_out()
|
|
137
|
+
ko_sol = model.optimize()
|
|
138
|
+
if ko_sol.objective_value < 0.01 * objective_value:
|
|
139
|
+
essential.append(gene.id)
|
|
140
|
+
|
|
141
|
+
print(f" FBA: objective={objective_value:.4f}, "
|
|
142
|
+
f"{len(essential)} essential genes")
|
|
143
|
+
|
|
144
|
+
result = {
|
|
145
|
+
"objective_value": objective_value,
|
|
146
|
+
"n_active_reactions": (fluxes.abs() > 1e-6).sum(),
|
|
147
|
+
"n_essential_genes": len(essential),
|
|
148
|
+
"essential_genes": essential,
|
|
149
|
+
}
|
|
150
|
+
return result, fluxes
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
## 3. 遺伝子制御ネットワーク推定(GRN)
|
|
154
|
+
|
|
155
|
+
```python
|
|
156
|
+
def infer_grn(expression_matrix, method="genie3", n_top=1000):
|
|
157
|
+
"""
|
|
158
|
+
遺伝子制御ネットワーク(GRN)推定。
|
|
159
|
+
|
|
160
|
+
method:
|
|
161
|
+
- "genie3": GENIE3 — Random Forest ベース
|
|
162
|
+
各遺伝子 gⱼ を他の全遺伝子で回帰し、
|
|
163
|
+
特徴量重要度を制御関係の重みとする。
|
|
164
|
+
- "scenic": SCENIC — cis-regulatory 解析統合
|
|
165
|
+
- "granger": Granger 因果性 — 時系列データ向け
|
|
166
|
+
|
|
167
|
+
GENIE3 原理:
|
|
168
|
+
For each target gene gⱼ:
|
|
169
|
+
Train RF: gⱼ = f(g₁, ..., gⱼ₋₁, gⱼ₊₁, ..., gₚ)
|
|
170
|
+
Weight wᵢⱼ = importance of gᵢ for predicting gⱼ
|
|
171
|
+
"""
|
|
172
|
+
from arboreto.algo import genie3
|
|
173
|
+
|
|
174
|
+
# GENIE3
|
|
175
|
+
if method == "genie3":
|
|
176
|
+
network = genie3(expression_matrix.values,
|
|
177
|
+
gene_names=expression_matrix.columns.tolist())
|
|
178
|
+
network = network.sort_values("importance", ascending=False).head(n_top)
|
|
179
|
+
|
|
180
|
+
# ネットワーク構築
|
|
181
|
+
import networkx as nx
|
|
182
|
+
G = nx.DiGraph()
|
|
183
|
+
for _, row in network.iterrows():
|
|
184
|
+
G.add_edge(row["TF"], row["target"], weight=row["importance"])
|
|
185
|
+
|
|
186
|
+
# ハブ TF(高出次数)
|
|
187
|
+
out_degrees = sorted(G.out_degree(), key=lambda x: x[1], reverse=True)
|
|
188
|
+
top_tfs = out_degrees[:10]
|
|
189
|
+
|
|
190
|
+
print(f" GRN: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
|
|
191
|
+
print(f" Top TFs: {', '.join([tf for tf, _ in top_tfs[:5]])}")
|
|
192
|
+
return G, network
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
## 4. パラメータ推定・感度解析
|
|
196
|
+
|
|
197
|
+
```python
|
|
198
|
+
from scipy.optimize import differential_evolution
|
|
199
|
+
from SALib.sample import saltelli
|
|
200
|
+
from SALib.analyze import sobol
|
|
201
|
+
|
|
202
|
+
def parameter_estimation(model_func, data, param_bounds, method="de"):
|
|
203
|
+
"""
|
|
204
|
+
ODE モデルパラメータ推定。
|
|
205
|
+
|
|
206
|
+
method:
|
|
207
|
+
- "de": Differential Evolution(グローバル最適化)
|
|
208
|
+
- "mcmc": MCMC — pymc/emcee(事後分布推定)
|
|
209
|
+
|
|
210
|
+
目的関数: Σ (y_data - y_model)² / σ² → minimize
|
|
211
|
+
"""
|
|
212
|
+
def objective(params):
|
|
213
|
+
y_pred = model_func(params)
|
|
214
|
+
residuals = (data["y_obs"] - y_pred) ** 2
|
|
215
|
+
return np.sum(residuals / data.get("sigma", 1) ** 2)
|
|
216
|
+
|
|
217
|
+
if method == "de":
|
|
218
|
+
result = differential_evolution(objective, bounds=param_bounds,
|
|
219
|
+
seed=42, maxiter=1000, tol=1e-8)
|
|
220
|
+
return {
|
|
221
|
+
"params": result.x,
|
|
222
|
+
"cost": result.fun,
|
|
223
|
+
"success": result.success,
|
|
224
|
+
"message": result.message,
|
|
225
|
+
}
|
|
226
|
+
|
|
227
|
+
|
|
228
|
+
def global_sensitivity_analysis(model_func, param_names, param_bounds,
|
|
229
|
+
n_samples=1024):
|
|
230
|
+
"""
|
|
231
|
+
Sobol グローバル感度解析。
|
|
232
|
+
|
|
233
|
+
指標:
|
|
234
|
+
- S1: 一次感度指標(主効果)
|
|
235
|
+
- ST: 全次感度指標(主効果+交互作用全て)
|
|
236
|
+
- S2: 二次感度指標(ペアワイズ交互作用)
|
|
237
|
+
|
|
238
|
+
S1 + 交互作用 = ST
|
|
239
|
+
ΣS1 < 1 の場合、交互作用効果が存在する。
|
|
240
|
+
"""
|
|
241
|
+
problem = {
|
|
242
|
+
"num_vars": len(param_names),
|
|
243
|
+
"names": param_names,
|
|
244
|
+
"bounds": param_bounds,
|
|
245
|
+
}
|
|
246
|
+
|
|
247
|
+
param_values = saltelli.sample(problem, n_samples)
|
|
248
|
+
Y = np.array([model_func(p) for p in param_values])
|
|
249
|
+
|
|
250
|
+
Si = sobol.analyze(problem, Y)
|
|
251
|
+
|
|
252
|
+
sa_df = pd.DataFrame({
|
|
253
|
+
"parameter": param_names,
|
|
254
|
+
"S1": Si["S1"],
|
|
255
|
+
"S1_conf": Si["S1_conf"],
|
|
256
|
+
"ST": Si["ST"],
|
|
257
|
+
"ST_conf": Si["ST_conf"],
|
|
258
|
+
})
|
|
259
|
+
|
|
260
|
+
print(f" Sensitivity: top parameter = {sa_df.loc[sa_df['ST'].idxmax(), 'parameter']} "
|
|
261
|
+
f"(ST={sa_df['ST'].max():.3f})")
|
|
262
|
+
return sa_df, Si
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
## References
|
|
266
|
+
|
|
267
|
+
### Output Files
|
|
268
|
+
|
|
269
|
+
| ファイル | 形式 |
|
|
270
|
+
|---|---|
|
|
271
|
+
| `results/simulation_timecourse.csv` | CSV |
|
|
272
|
+
| `results/fba_fluxes.csv` | CSV |
|
|
273
|
+
| `results/grn_network.csv` | CSV |
|
|
274
|
+
| `results/sensitivity_analysis.csv` | CSV |
|
|
275
|
+
| `results/parameter_estimates.json` | JSON |
|
|
276
|
+
| `figures/timecourse_plot.png` | PNG |
|
|
277
|
+
| `figures/flux_map.png` | PNG |
|
|
278
|
+
| `figures/grn_graph.png` | PNG |
|
|
279
|
+
|
|
280
|
+
### 利用可能ツール
|
|
281
|
+
|
|
282
|
+
> [ToolUniverse](https://github.com/mims-harvard/ToolUniverse) SMCP 経由で利用可能な外部ツール。
|
|
283
|
+
|
|
284
|
+
| カテゴリ | 主要ツール | 用途 |
|
|
285
|
+
|---|---|---|
|
|
286
|
+
| BioModels | `biomodels_search` | SBML モデル検索 |
|
|
287
|
+
| BioModels | `BioModels_get_model` | モデル詳細取得 |
|
|
288
|
+
| BioModels | `BioModels_download_model` | モデルダウンロード |
|
|
289
|
+
| Reactome | `Reactome_get_pathway` | パスウェイ情報取得 |
|
|
290
|
+
| Reactome | `Reactome_get_pathway_reactions` | 反応一覧取得 |
|
|
291
|
+
| Reactome | `Reactome_map_uniprot_to_pathways` | UniProt→パスウェイ |
|
|
292
|
+
| BiGG | `BiGG_search` | 代謝モデル検索 |
|
|
293
|
+
| BiGG | `BiGG_get_model` | GEM モデル取得 |
|
|
294
|
+
| BiGG | `BiGG_get_reaction` | 反応詳細取得 |
|
|
295
|
+
| KEGG | `kegg_get_pathway_info` | KEGG パスウェイ |
|
|
296
|
+
| KEGG | `kegg_get_gene_info` | KEGG 遺伝子情報 |
|
|
297
|
+
|
|
298
|
+
### 参照スキル
|
|
299
|
+
|
|
300
|
+
| スキル | 連携内容 |
|
|
301
|
+
|---|---|
|
|
302
|
+
| [scientific-network-analysis](../scientific-network-analysis/SKILL.md) | GRN ネットワーク解析 |
|
|
303
|
+
| [scientific-multi-omics](../scientific-multi-omics/SKILL.md) | マルチオミクスデータ統合 |
|
|
304
|
+
| [scientific-bayesian-statistics](../scientific-bayesian-statistics/SKILL.md) | ベイズパラメータ推定 |
|
|
305
|
+
| [scientific-doe](../scientific-doe/SKILL.md) | 実験設計・感度解析 |
|
|
306
|
+
| [scientific-metabolomics](../scientific-metabolomics/SKILL.md) | 代謝フラックス-メタボローム統合 |
|
|
307
|
+
|
|
308
|
+
#### 依存パッケージ
|
|
309
|
+
|
|
310
|
+
- cobra (cobrapy), roadrunner (libroadrunner), arboreto, SALib, scipy, networkx
|
|
@@ -0,0 +1,358 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: scientific-text-mining-nlp
|
|
3
|
+
description: |
|
|
4
|
+
科学テキストマイニング・NLP スキル。生物医学 NER(遺伝子/疾患/薬物/化合物)・
|
|
5
|
+
関係抽出(PPI / DDI / GDA)・文献ベースナレッジグラフ構築・
|
|
6
|
+
エビデンス要約・トピックモデリング・引用ネットワーク解析パイプライン。
|
|
7
|
+
PubTator / SemanticScholar / EuropePMC データ統合。
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Scientific Text Mining & NLP
|
|
11
|
+
|
|
12
|
+
科学文献に対する自然言語処理(NLP)パイプラインを提供する。
|
|
13
|
+
生物医学エンティティ認識、関係抽出、ナレッジグラフ構築、
|
|
14
|
+
トピックモデリング、自動エビデンス要約を体系的に扱う。
|
|
15
|
+
|
|
16
|
+
## When to Use
|
|
17
|
+
|
|
18
|
+
- 大量の科学文献から遺伝子・疾患・薬物名を自動抽出するとき
|
|
19
|
+
- タンパク質-タンパク質相互作用(PPI)等の関係を文献から抽出するとき
|
|
20
|
+
- 文献ベースのナレッジグラフを構築するとき
|
|
21
|
+
- 研究トレンドのトピックモデリングを行うとき
|
|
22
|
+
- 引用ネットワーク分析で影響力のある論文を同定するとき
|
|
23
|
+
|
|
24
|
+
---
|
|
25
|
+
|
|
26
|
+
## Quick Start
|
|
27
|
+
|
|
28
|
+
## 1. 生物医学 NER(Named Entity Recognition)
|
|
29
|
+
|
|
30
|
+
```python
|
|
31
|
+
import numpy as np
|
|
32
|
+
import pandas as pd
|
|
33
|
+
|
|
34
|
+
def biomedical_ner(texts, model="biobert", entity_types=None):
|
|
35
|
+
"""
|
|
36
|
+
生物医学テキストからのエンティティ認識。
|
|
37
|
+
|
|
38
|
+
model:
|
|
39
|
+
- "biobert": BioBERT — PubMed 事前学習 BERT
|
|
40
|
+
- "scispacy": SciSpaCy — 科学テキスト特化 spaCy
|
|
41
|
+
- "pubtator": PubTator3 API — NCBI の NER サービス
|
|
42
|
+
|
|
43
|
+
entity_types:
|
|
44
|
+
- Gene/Protein: 遺伝子・タンパク質名
|
|
45
|
+
- Disease: 疾患名(MESH / OMIM ID)
|
|
46
|
+
- Chemical/Drug: 化合物・薬物名(MeSH / DrugBank ID)
|
|
47
|
+
- Species: 生物種
|
|
48
|
+
- Mutation: 変異(tmVar 形式)
|
|
49
|
+
- Cell Line / Cell Type
|
|
50
|
+
"""
|
|
51
|
+
if entity_types is None:
|
|
52
|
+
entity_types = ["Gene", "Disease", "Chemical", "Species", "Mutation"]
|
|
53
|
+
|
|
54
|
+
if model == "scispacy":
|
|
55
|
+
import spacy
|
|
56
|
+
nlp = spacy.load("en_core_sci_lg")
|
|
57
|
+
from scispacy.linking import EntityLinker
|
|
58
|
+
nlp.add_pipe("scispacy_linker", config={
|
|
59
|
+
"resolve_abbreviations": True,
|
|
60
|
+
"linker_name": "umls"
|
|
61
|
+
})
|
|
62
|
+
|
|
63
|
+
all_entities = []
|
|
64
|
+
for i, text in enumerate(texts):
|
|
65
|
+
doc = nlp(text)
|
|
66
|
+
for ent in doc.ents:
|
|
67
|
+
all_entities.append({
|
|
68
|
+
"doc_id": i,
|
|
69
|
+
"text": ent.text,
|
|
70
|
+
"label": ent.label_,
|
|
71
|
+
"start": ent.start_char,
|
|
72
|
+
"end": ent.end_char,
|
|
73
|
+
"kb_id": ent._.kb_ents[0][0] if ent._.kb_ents else None,
|
|
74
|
+
"confidence": ent._.kb_ents[0][1] if ent._.kb_ents else None,
|
|
75
|
+
})
|
|
76
|
+
|
|
77
|
+
df = pd.DataFrame(all_entities)
|
|
78
|
+
print(f" NER: {len(df)} entities from {len(texts)} documents")
|
|
79
|
+
return df
|
|
80
|
+
|
|
81
|
+
elif model == "biobert":
|
|
82
|
+
from transformers import pipeline
|
|
83
|
+
ner_pipeline = pipeline("ner", model="dmis-lab/biobert-large-cased-v1.1-ner",
|
|
84
|
+
aggregation_strategy="simple")
|
|
85
|
+
|
|
86
|
+
all_entities = []
|
|
87
|
+
for i, text in enumerate(texts):
|
|
88
|
+
entities = ner_pipeline(text)
|
|
89
|
+
for ent in entities:
|
|
90
|
+
all_entities.append({
|
|
91
|
+
"doc_id": i, "text": ent["word"],
|
|
92
|
+
"label": ent["entity_group"],
|
|
93
|
+
"score": ent["score"],
|
|
94
|
+
})
|
|
95
|
+
|
|
96
|
+
return pd.DataFrame(all_entities)
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
## 2. 関係抽出
|
|
100
|
+
|
|
101
|
+
```python
|
|
102
|
+
def relation_extraction(texts, relation_type="ppi", model="biobert_re"):
|
|
103
|
+
"""
|
|
104
|
+
科学文献からの関係抽出。
|
|
105
|
+
|
|
106
|
+
relation_type:
|
|
107
|
+
- "ppi": Protein-Protein Interaction
|
|
108
|
+
- "ddi": Drug-Drug Interaction
|
|
109
|
+
- "gda": Gene-Disease Association
|
|
110
|
+
- "chem_disease": Chemical-Disease Relation
|
|
111
|
+
- "chem_gene": Chemical-Gene Interaction
|
|
112
|
+
|
|
113
|
+
パイプライン:
|
|
114
|
+
1. NER でエンティティ抽出
|
|
115
|
+
2. 同一文内のエンティティペアを候補として列挙
|
|
116
|
+
3. 各ペアの関係分類(BERT ベース)
|
|
117
|
+
4. 信頼度フィルタリング
|
|
118
|
+
"""
|
|
119
|
+
from transformers import pipeline
|
|
120
|
+
|
|
121
|
+
if relation_type == "ppi":
|
|
122
|
+
re_model = "dmis-lab/biobert-v1.1" # Fine-tuned for PPI
|
|
123
|
+
elif relation_type == "ddi":
|
|
124
|
+
re_model = "dmis-lab/biobert-v1.1"
|
|
125
|
+
|
|
126
|
+
classifier = pipeline("text-classification", model=re_model)
|
|
127
|
+
|
|
128
|
+
relations = []
|
|
129
|
+
for i, text in enumerate(texts):
|
|
130
|
+
# エンティティペア候補をマーキング
|
|
131
|
+
ner_results = biomedical_ner([text], model="scispacy")
|
|
132
|
+
entities = ner_results[ner_results["doc_id"] == 0]
|
|
133
|
+
|
|
134
|
+
# 全ペアの関係分類
|
|
135
|
+
for idx_a, ent_a in entities.iterrows():
|
|
136
|
+
for idx_b, ent_b in entities.iterrows():
|
|
137
|
+
if idx_a < idx_b:
|
|
138
|
+
# コンテキスト付きテキスト
|
|
139
|
+
marked_text = mark_entities(text, ent_a, ent_b)
|
|
140
|
+
pred = classifier(marked_text[:512])
|
|
141
|
+
|
|
142
|
+
if pred[0]["score"] > 0.7:
|
|
143
|
+
relations.append({
|
|
144
|
+
"doc_id": i,
|
|
145
|
+
"entity_a": ent_a["text"],
|
|
146
|
+
"entity_b": ent_b["text"],
|
|
147
|
+
"relation": pred[0]["label"],
|
|
148
|
+
"confidence": pred[0]["score"],
|
|
149
|
+
})
|
|
150
|
+
|
|
151
|
+
df = pd.DataFrame(relations)
|
|
152
|
+
print(f" RE: {len(df)} relations from {len(texts)} documents")
|
|
153
|
+
return df
|
|
154
|
+
|
|
155
|
+
|
|
156
|
+
def mark_entities(text, ent_a, ent_b):
|
|
157
|
+
"""エンティティをマーキングしたテキストを生成。"""
|
|
158
|
+
# 簡易実装: @ENTITY_A@ / @ENTITY_B@ でマーク
|
|
159
|
+
marked = text.replace(ent_a["text"], f"@ENTITY_A@ {ent_a['text']} @/ENTITY_A@")
|
|
160
|
+
marked = marked.replace(ent_b["text"], f"@ENTITY_B@ {ent_b['text']} @/ENTITY_B@")
|
|
161
|
+
return marked
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
## 3. ナレッジグラフ構築
|
|
165
|
+
|
|
166
|
+
```python
|
|
167
|
+
def build_knowledge_graph(entities_df, relations_df, min_confidence=0.7):
|
|
168
|
+
"""
|
|
169
|
+
文献ベースのナレッジグラフ構築。
|
|
170
|
+
|
|
171
|
+
ノード: エンティティ(遺伝子、疾患、薬物、経路 etc.)
|
|
172
|
+
エッジ: 関係(interacts_with, treats, causes, associated_with etc.)
|
|
173
|
+
|
|
174
|
+
パイプライン:
|
|
175
|
+
1. エンティティ正規化(UMLS CUI / MeSH 統一)
|
|
176
|
+
2. 重複エンティティマージ
|
|
177
|
+
3. 関係集約(頻度 + 最大信頼度)
|
|
178
|
+
4. グラフ構築 + コミュニティ検出
|
|
179
|
+
"""
|
|
180
|
+
import networkx as nx
|
|
181
|
+
from collections import Counter
|
|
182
|
+
|
|
183
|
+
# 信頼度フィルタ
|
|
184
|
+
rel_filtered = relations_df[relations_df["confidence"] >= min_confidence]
|
|
185
|
+
|
|
186
|
+
# グラフ構築
|
|
187
|
+
G = nx.MultiDiGraph()
|
|
188
|
+
|
|
189
|
+
# エンティティノード追加
|
|
190
|
+
for _, ent in entities_df.iterrows():
|
|
191
|
+
G.add_node(ent["text"], type=ent["label"],
|
|
192
|
+
kb_id=ent.get("kb_id", None))
|
|
193
|
+
|
|
194
|
+
# 関係エッジ追加
|
|
195
|
+
edge_counts = Counter()
|
|
196
|
+
for _, rel in rel_filtered.iterrows():
|
|
197
|
+
key = (rel["entity_a"], rel["entity_b"], rel["relation"])
|
|
198
|
+
edge_counts[key] += 1
|
|
199
|
+
G.add_edge(rel["entity_a"], rel["entity_b"],
|
|
200
|
+
relation=rel["relation"],
|
|
201
|
+
confidence=rel["confidence"],
|
|
202
|
+
frequency=edge_counts[key])
|
|
203
|
+
|
|
204
|
+
# コミュニティ検出
|
|
205
|
+
G_simple = nx.Graph(G)
|
|
206
|
+
from networkx.algorithms.community import louvain_communities
|
|
207
|
+
communities = louvain_communities(G_simple, resolution=1.0)
|
|
208
|
+
|
|
209
|
+
print(f" KG: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges, "
|
|
210
|
+
f"{len(communities)} communities")
|
|
211
|
+
return G, communities
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
## 4. トピックモデリング
|
|
215
|
+
|
|
216
|
+
```python
|
|
217
|
+
def topic_modeling(abstracts, n_topics=10, method="bertopic"):
|
|
218
|
+
"""
|
|
219
|
+
科学文献のトピックモデリング。
|
|
220
|
+
|
|
221
|
+
method:
|
|
222
|
+
- "bertopic": BERTopic — BERT 埋め込み + HDBSCAN + c-TF-IDF
|
|
223
|
+
- "lda": LDA (Latent Dirichlet Allocation) — 確率的トピックモデル
|
|
224
|
+
- "nmf": NMF (Non-negative Matrix Factorization)
|
|
225
|
+
|
|
226
|
+
BERTopic パイプライン:
|
|
227
|
+
1. BERT / SPECTER で文書埋め込み
|
|
228
|
+
2. UMAP で次元削減
|
|
229
|
+
3. HDBSCAN でクラスタリング
|
|
230
|
+
4. c-TF-IDF でトピックワード抽出
|
|
231
|
+
"""
|
|
232
|
+
if method == "bertopic":
|
|
233
|
+
from bertopic import BERTopic
|
|
234
|
+
from sentence_transformers import SentenceTransformer
|
|
235
|
+
|
|
236
|
+
embedding_model = SentenceTransformer("allenai-specter")
|
|
237
|
+
topic_model = BERTopic(embedding_model=embedding_model,
|
|
238
|
+
nr_topics=n_topics,
|
|
239
|
+
calculate_probabilities=True)
|
|
240
|
+
|
|
241
|
+
topics, probs = topic_model.fit_transform(abstracts)
|
|
242
|
+
|
|
243
|
+
topic_info = topic_model.get_topic_info()
|
|
244
|
+
print(f" Topics: {len(topic_info) - 1} topics from {len(abstracts)} documents")
|
|
245
|
+
return topic_model, topics, probs
|
|
246
|
+
|
|
247
|
+
elif method == "lda":
|
|
248
|
+
from sklearn.decomposition import LatentDirichletAllocation
|
|
249
|
+
from sklearn.feature_extraction.text import CountVectorizer
|
|
250
|
+
|
|
251
|
+
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words="english")
|
|
252
|
+
dtm = vectorizer.fit_transform(abstracts)
|
|
253
|
+
|
|
254
|
+
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
|
|
255
|
+
lda.fit(dtm)
|
|
256
|
+
|
|
257
|
+
feature_names = vectorizer.get_feature_names_out()
|
|
258
|
+
topics = {}
|
|
259
|
+
for i, topic_dist in enumerate(lda.components_):
|
|
260
|
+
top_words = [feature_names[j] for j in topic_dist.argsort()[-10:]]
|
|
261
|
+
topics[f"Topic_{i}"] = top_words
|
|
262
|
+
|
|
263
|
+
return lda, topics
|
|
264
|
+
```
|
|
265
|
+
|
|
266
|
+
## 5. 引用ネットワーク分析
|
|
267
|
+
|
|
268
|
+
```python
|
|
269
|
+
def citation_network_analysis(papers_df, citations_df):
|
|
270
|
+
"""
|
|
271
|
+
引用ネットワーク分析。
|
|
272
|
+
|
|
273
|
+
指標:
|
|
274
|
+
- In-degree: 被引用数 → 影響力
|
|
275
|
+
- PageRank: 引用の質を加味した影響力
|
|
276
|
+
- Hub/Authority (HITS): Hub=多数引用、Authority=多数被引用
|
|
277
|
+
- Citation burst: 急激な被引用増加(新興トピック)
|
|
278
|
+
- Bibliographic coupling: 同じ論文を引用するペア
|
|
279
|
+
- Co-citation: 同時に引用されるペア
|
|
280
|
+
"""
|
|
281
|
+
import networkx as nx
|
|
282
|
+
|
|
283
|
+
G = nx.DiGraph()
|
|
284
|
+
for _, paper in papers_df.iterrows():
|
|
285
|
+
G.add_node(paper["paper_id"], title=paper["title"],
|
|
286
|
+
year=paper["year"])
|
|
287
|
+
|
|
288
|
+
for _, cite in citations_df.iterrows():
|
|
289
|
+
G.add_edge(cite["citing"], cite["cited"])
|
|
290
|
+
|
|
291
|
+
# PageRank
|
|
292
|
+
pagerank = nx.pagerank(G, alpha=0.85)
|
|
293
|
+
|
|
294
|
+
# HITS
|
|
295
|
+
hubs, authorities = nx.hits(G, max_iter=100)
|
|
296
|
+
|
|
297
|
+
# 結果集約
|
|
298
|
+
metrics_df = pd.DataFrame({
|
|
299
|
+
"paper_id": list(G.nodes()),
|
|
300
|
+
"in_degree": [G.in_degree(n) for n in G.nodes()],
|
|
301
|
+
"out_degree": [G.out_degree(n) for n in G.nodes()],
|
|
302
|
+
"pagerank": [pagerank.get(n, 0) for n in G.nodes()],
|
|
303
|
+
"hub_score": [hubs.get(n, 0) for n in G.nodes()],
|
|
304
|
+
"authority_score": [authorities.get(n, 0) for n in G.nodes()],
|
|
305
|
+
})
|
|
306
|
+
metrics_df = metrics_df.sort_values("pagerank", ascending=False)
|
|
307
|
+
|
|
308
|
+
print(f" Citation network: {G.number_of_nodes()} papers, "
|
|
309
|
+
f"{G.number_of_edges()} citations")
|
|
310
|
+
return G, metrics_df
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
## References
|
|
314
|
+
|
|
315
|
+
### Output Files
|
|
316
|
+
|
|
317
|
+
| ファイル | 形式 |
|
|
318
|
+
|---|---|
|
|
319
|
+
| `results/ner_entities.csv` | CSV |
|
|
320
|
+
| `results/relations.csv` | CSV |
|
|
321
|
+
| `results/knowledge_graph.json` | JSON |
|
|
322
|
+
| `results/topic_model_info.csv` | CSV |
|
|
323
|
+
| `results/citation_metrics.csv` | CSV |
|
|
324
|
+
| `figures/kg_visualization.png` | PNG |
|
|
325
|
+
| `figures/topic_distribution.png` | PNG |
|
|
326
|
+
| `figures/citation_network.png` | PNG |
|
|
327
|
+
|
|
328
|
+
### 利用可能ツール
|
|
329
|
+
|
|
330
|
+
> [ToolUniverse](https://github.com/mims-harvard/ToolUniverse) SMCP 経由で利用可能な外部ツール。
|
|
331
|
+
|
|
332
|
+
| カテゴリ | 主要ツール | 用途 |
|
|
333
|
+
|---|---|---|
|
|
334
|
+
| PubTator | `PubTator3_LiteratureSearch` | NER 付き文献検索 |
|
|
335
|
+
| PubTator | `PubTator3_EntityAutocomplete` | エンティティ補完 |
|
|
336
|
+
| PubMed | `PubMed_search_articles` | PubMed 文献検索 |
|
|
337
|
+
| PubMed | `PubMed_get_cited_by` | 被引用論文取得 |
|
|
338
|
+
| PubMed | `PubMed_get_related` | 関連論文取得 |
|
|
339
|
+
| SemanticScholar | `SemanticScholar_search_papers` | 学術論文検索 |
|
|
340
|
+
| EuropePMC | `EuropePMC_search_articles` | EuropePMC 検索 |
|
|
341
|
+
| EuropePMC | `EuropePMC_get_fulltext` | 全文テキスト取得 |
|
|
342
|
+
| EuropePMC | `EuropePMC_get_references` | 引用文献取得 |
|
|
343
|
+
| OpenAlex | `openalex_search_works` | OpenAlex 検索 |
|
|
344
|
+
| DBLP | `DBLP_search_publications` | CS 文献検索 |
|
|
345
|
+
|
|
346
|
+
### 参照スキル
|
|
347
|
+
|
|
348
|
+
| スキル | 連携内容 |
|
|
349
|
+
|---|---|
|
|
350
|
+
| [scientific-deep-research](../scientific-deep-research/SKILL.md) | 深層文献調査 |
|
|
351
|
+
| [scientific-citation-checker](../scientific-citation-checker/SKILL.md) | 引用検証 |
|
|
352
|
+
| [scientific-network-analysis](../scientific-network-analysis/SKILL.md) | ネットワーク解析 |
|
|
353
|
+
| [scientific-meta-analysis](../scientific-meta-analysis/SKILL.md) | 系統的文献レビュー |
|
|
354
|
+
| [scientific-graph-neural-networks](../scientific-graph-neural-networks/SKILL.md) | ナレッジグラフ推論 |
|
|
355
|
+
|
|
356
|
+
#### 依存パッケージ
|
|
357
|
+
|
|
358
|
+
- scispacy, spacy, transformers, bertopic, sentence-transformers, networkx
|
|
@@ -301,6 +301,20 @@ def pgx_recommendation(gene, phenotype, drug):
|
|
|
301
301
|
| `results/variant_classification.json` | ACMG/AMP 分類データ(JSON) | 分類完了時 |
|
|
302
302
|
| `results/pgx_report.json` | 薬理ゲノミクスレポート(JSON) | PGx 評価完了時 |
|
|
303
303
|
|
|
304
|
+
### 利用可能ツール
|
|
305
|
+
|
|
306
|
+
> [ToolUniverse](https://github.com/mims-harvard/ToolUniverse) SMCP 経由で利用可能な外部ツール。
|
|
307
|
+
|
|
308
|
+
| カテゴリ | 主要ツール | 用途 |
|
|
309
|
+
|---|---|---|
|
|
310
|
+
| ClinVar | `clinvar_search_variants` | バリアントの病原性分類検索 |
|
|
311
|
+
| gnomAD | `gnomad_get_gene_constraints` | 遺伝子制約メトリクス(pLI / LOEUF) |
|
|
312
|
+
| ClinGen | `ClinGen_get_gene_validity` | 遺伝子-疾患の妥当性評価 |
|
|
313
|
+
| AlphaMissense | `AlphaMissense_get_variant_score` | ミスセンス病原性予測スコア |
|
|
314
|
+
| PharmGKB | `PharmGKB_search_variants` | 薬理ゲノミクスバリアント検索 |
|
|
315
|
+
| CADD | `CADD_get_variant_score` | バリアント有害性スコア |
|
|
316
|
+
| MyVariant | `MyVariant_get_variant_annotation` | 統合バリアントアノテーション |
|
|
317
|
+
|
|
304
318
|
### 参照スキル
|
|
305
319
|
|
|
306
320
|
| スキル | 連携 |
|