@nahisaho/satori 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENCE +0 -0
- package/README.md +191 -0
- package/bin/satori.js +95 -0
- package/package.json +29 -0
- package/src/.github/skills/scientific-academic-writing/SKILL.md +361 -0
- package/src/.github/skills/scientific-academic-writing/assets/acs_article.md +199 -0
- package/src/.github/skills/scientific-academic-writing/assets/elsevier_article.md +244 -0
- package/src/.github/skills/scientific-academic-writing/assets/ieee_transactions.md +212 -0
- package/src/.github/skills/scientific-academic-writing/assets/imrad_standard.md +181 -0
- package/src/.github/skills/scientific-academic-writing/assets/nature_article.md +179 -0
- package/src/.github/skills/scientific-academic-writing/assets/qiita_technical_article.md +385 -0
- package/src/.github/skills/scientific-academic-writing/assets/science_research_article.md +169 -0
- package/src/.github/skills/scientific-bioinformatics/SKILL.md +220 -0
- package/src/.github/skills/scientific-biosignal-processing/SKILL.md +357 -0
- package/src/.github/skills/scientific-causal-inference/SKILL.md +347 -0
- package/src/.github/skills/scientific-cheminformatics/SKILL.md +196 -0
- package/src/.github/skills/scientific-data-preprocessing/SKILL.md +413 -0
- package/src/.github/skills/scientific-data-simulation/SKILL.md +244 -0
- package/src/.github/skills/scientific-doe/SKILL.md +360 -0
- package/src/.github/skills/scientific-eda-correlation/SKILL.md +141 -0
- package/src/.github/skills/scientific-feature-importance/SKILL.md +208 -0
- package/src/.github/skills/scientific-image-analysis/SKILL.md +310 -0
- package/src/.github/skills/scientific-materials-characterization/SKILL.md +368 -0
- package/src/.github/skills/scientific-meta-analysis/SKILL.md +352 -0
- package/src/.github/skills/scientific-metabolomics/SKILL.md +326 -0
- package/src/.github/skills/scientific-ml-classification/SKILL.md +265 -0
- package/src/.github/skills/scientific-ml-regression/SKILL.md +215 -0
- package/src/.github/skills/scientific-multi-omics/SKILL.md +303 -0
- package/src/.github/skills/scientific-network-analysis/SKILL.md +257 -0
- package/src/.github/skills/scientific-pca-tsne/SKILL.md +235 -0
- package/src/.github/skills/scientific-pipeline-scaffold/SKILL.md +331 -0
- package/src/.github/skills/scientific-process-optimization/SKILL.md +215 -0
- package/src/.github/skills/scientific-publication-figures/SKILL.md +208 -0
- package/src/.github/skills/scientific-sequence-analysis/SKILL.md +389 -0
- package/src/.github/skills/scientific-spectral-signal/SKILL.md +227 -0
- package/src/.github/skills/scientific-statistical-testing/SKILL.md +240 -0
- package/src/.github/skills/scientific-survival-clinical/SKILL.md +239 -0
- package/src/.github/skills/scientific-time-series/SKILL.md +291 -0
|
@@ -0,0 +1,413 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: scientific-data-preprocessing
|
|
3
|
+
description: |
|
|
4
|
+
科学データの前処理パイプラインスキル。欠損値補完(KNNImputer/SimpleImputer)、
|
|
5
|
+
エンコーディング(LabelEncoder/OneHot/ダミー変数)、スケーリング(Standard/MinMax/Robust/Pareto)、
|
|
6
|
+
対数変換、外れ値処理のテンプレートを提供。全 Exp-01〜13 に横断的に適用される基盤スキル。
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Scientific Data Preprocessing
|
|
10
|
+
|
|
11
|
+
科学データ解析における前処理の標準パイプライン。データクリーニング、変換、正規化の
|
|
12
|
+
一連の手順をトレーサブルかつ再現可能な形で実装するためのテンプレート集。
|
|
13
|
+
全 Exp-01〜13 に共通処理として適用される基盤スキル。
|
|
14
|
+
|
|
15
|
+
## When to Use
|
|
16
|
+
|
|
17
|
+
- CSV / DataFrame のデータを解析前にクリーニング・変換したいとき
|
|
18
|
+
- 欠損値の補完戦略を選択する必要があるとき
|
|
19
|
+
- カテゴリカル変数のエンコーディングが必要なとき
|
|
20
|
+
- 数値データのスケーリング・正規化が必要なとき
|
|
21
|
+
- 外れ値の検出・除去が必要なとき
|
|
22
|
+
|
|
23
|
+
## Quick Start
|
|
24
|
+
|
|
25
|
+
## 1. 前処理パイプライン概要
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
Raw Data
|
|
29
|
+
├─ Step 1: データ品質チェック
|
|
30
|
+
├─ Step 2: 欠損値処理
|
|
31
|
+
├─ Step 3: 外れ値処理
|
|
32
|
+
├─ Step 4: エンコーディング
|
|
33
|
+
├─ Step 5: スケーリング / 変換
|
|
34
|
+
└─ Step 6: 品質確認レポート
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## 2. データ品質チェック
|
|
38
|
+
|
|
39
|
+
```python
|
|
40
|
+
import pandas as pd
|
|
41
|
+
import numpy as np
|
|
42
|
+
|
|
43
|
+
def data_quality_report(df, name="dataset"):
|
|
44
|
+
"""
|
|
45
|
+
データ品質レポートを出力する。
|
|
46
|
+
前処理方針の決定材料として最初に実行すること。
|
|
47
|
+
"""
|
|
48
|
+
report = pd.DataFrame({
|
|
49
|
+
"dtype": df.dtypes,
|
|
50
|
+
"nunique": df.nunique(),
|
|
51
|
+
"missing_count": df.isnull().sum(),
|
|
52
|
+
"missing_pct": (df.isnull().sum() / len(df) * 100).round(2),
|
|
53
|
+
"zeros_count": (df == 0).sum(),
|
|
54
|
+
})
|
|
55
|
+
|
|
56
|
+
# 数値カラムの統計
|
|
57
|
+
numeric_cols = df.select_dtypes(include=[np.number]).columns
|
|
58
|
+
for col in numeric_cols:
|
|
59
|
+
report.loc[col, "mean"] = df[col].mean()
|
|
60
|
+
report.loc[col, "std"] = df[col].std()
|
|
61
|
+
report.loc[col, "skewness"] = df[col].skew()
|
|
62
|
+
report.loc[col, "kurtosis"] = df[col].kurtosis()
|
|
63
|
+
|
|
64
|
+
print(f"=== Data Quality Report: {name} ===")
|
|
65
|
+
print(f"Shape: {df.shape}")
|
|
66
|
+
print(f"Duplicated rows: {df.duplicated().sum()}")
|
|
67
|
+
print(f"Columns with >50% missing: "
|
|
68
|
+
f"{report[report['missing_pct'] > 50].index.tolist()}")
|
|
69
|
+
print(report.to_string())
|
|
70
|
+
|
|
71
|
+
return report
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
## 3. 欠損値処理 — 選択ガイド
|
|
75
|
+
|
|
76
|
+
| 手法 | ユースケース | 注意点 |
|
|
77
|
+
|---|---|---|
|
|
78
|
+
| `dropna()` | 完全ケース解析、欠損 <5% | サンプルサイズ減少 |
|
|
79
|
+
| `fillna(median)` | 数値データ、外れ値あり | 分布を歪めうる |
|
|
80
|
+
| `fillna(mean)` | 数値データ、正規分布 | 外れ値に敏感 |
|
|
81
|
+
| `SimpleImputer(strategy='most_frequent')` | カテゴリカル変数 | |
|
|
82
|
+
| `KNNImputer(n_neighbors=5)` | 変数間に相関がある場合 | 計算コスト高い |
|
|
83
|
+
| `IterativeImputer` | 多変量欠損パターン (MICE) | scikit-learn experimental |
|
|
84
|
+
|
|
85
|
+
```python
|
|
86
|
+
from sklearn.impute import SimpleImputer, KNNImputer
|
|
87
|
+
|
|
88
|
+
def handle_missing_values(df, strategy="auto", numeric_cols=None, categorical_cols=None):
|
|
89
|
+
"""
|
|
90
|
+
欠損値処理パイプライン。
|
|
91
|
+
|
|
92
|
+
strategy:
|
|
93
|
+
"auto" — 欠損率に応じて自動選択
|
|
94
|
+
"drop" — 欠損行を除去
|
|
95
|
+
"median" — 中央値で補完
|
|
96
|
+
"knn" — KNN で補完 (n_neighbors=5)
|
|
97
|
+
"""
|
|
98
|
+
df = df.copy()
|
|
99
|
+
|
|
100
|
+
if numeric_cols is None:
|
|
101
|
+
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
|
|
102
|
+
if categorical_cols is None:
|
|
103
|
+
categorical_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()
|
|
104
|
+
|
|
105
|
+
if strategy == "auto":
|
|
106
|
+
missing_pct = df[numeric_cols].isnull().mean()
|
|
107
|
+
# カラムごとに最適戦略を適用
|
|
108
|
+
low_missing = missing_pct[missing_pct < 0.05].index.tolist()
|
|
109
|
+
mid_missing = missing_pct[(missing_pct >= 0.05) & (missing_pct < 0.30)].index.tolist()
|
|
110
|
+
high_missing = missing_pct[missing_pct >= 0.30].index.tolist()
|
|
111
|
+
|
|
112
|
+
# Low: 中央値補完
|
|
113
|
+
if low_missing:
|
|
114
|
+
imp = SimpleImputer(strategy="median")
|
|
115
|
+
df[low_missing] = imp.fit_transform(df[low_missing])
|
|
116
|
+
|
|
117
|
+
# Mid: KNN 補完
|
|
118
|
+
if mid_missing:
|
|
119
|
+
imp = KNNImputer(n_neighbors=5)
|
|
120
|
+
df[mid_missing] = imp.fit_transform(df[mid_missing])
|
|
121
|
+
|
|
122
|
+
# High: 警告表示のみ (ドロップ推奨)
|
|
123
|
+
if high_missing:
|
|
124
|
+
print(f"WARNING: High missing rate columns ({'>30%'}): {high_missing}")
|
|
125
|
+
print("Consider dropping these columns or using domain-specific imputation.")
|
|
126
|
+
|
|
127
|
+
elif strategy == "drop":
|
|
128
|
+
df = df.dropna(subset=numeric_cols)
|
|
129
|
+
|
|
130
|
+
elif strategy == "median":
|
|
131
|
+
imp = SimpleImputer(strategy="median")
|
|
132
|
+
df[numeric_cols] = imp.fit_transform(df[numeric_cols])
|
|
133
|
+
|
|
134
|
+
elif strategy == "knn":
|
|
135
|
+
imp = KNNImputer(n_neighbors=5)
|
|
136
|
+
df[numeric_cols] = imp.fit_transform(df[numeric_cols])
|
|
137
|
+
|
|
138
|
+
# カテゴリカル変数 → 最頻値で補完
|
|
139
|
+
if categorical_cols:
|
|
140
|
+
imp_cat = SimpleImputer(strategy="most_frequent")
|
|
141
|
+
df[categorical_cols] = imp_cat.fit_transform(df[categorical_cols])
|
|
142
|
+
|
|
143
|
+
return df
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
## 4. 外れ値処理
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
def detect_outliers(df, columns, method="iqr", threshold=1.5):
|
|
150
|
+
"""
|
|
151
|
+
外れ値を検出する。
|
|
152
|
+
|
|
153
|
+
method:
|
|
154
|
+
"iqr" — IQR法 (Q1 - threshold*IQR, Q3 + threshold*IQR)
|
|
155
|
+
"zscore" — Z-score 法 (|z| > threshold; default threshold=3)
|
|
156
|
+
"mad" — MAD (Median Absolute Deviation) 法
|
|
157
|
+
"""
|
|
158
|
+
outlier_mask = pd.DataFrame(False, index=df.index, columns=columns)
|
|
159
|
+
|
|
160
|
+
for col in columns:
|
|
161
|
+
vals = df[col].dropna()
|
|
162
|
+
if method == "iqr":
|
|
163
|
+
Q1, Q3 = vals.quantile(0.25), vals.quantile(0.75)
|
|
164
|
+
IQR = Q3 - Q1
|
|
165
|
+
lower, upper = Q1 - threshold * IQR, Q3 + threshold * IQR
|
|
166
|
+
outlier_mask[col] = (df[col] < lower) | (df[col] > upper)
|
|
167
|
+
|
|
168
|
+
elif method == "zscore":
|
|
169
|
+
z = np.abs((df[col] - vals.mean()) / vals.std())
|
|
170
|
+
outlier_mask[col] = z > (threshold if threshold != 1.5 else 3)
|
|
171
|
+
|
|
172
|
+
elif method == "mad":
|
|
173
|
+
median = vals.median()
|
|
174
|
+
mad = np.median(np.abs(vals - median))
|
|
175
|
+
modified_z = 0.6745 * (df[col] - median) / (mad + 1e-10)
|
|
176
|
+
outlier_mask[col] = np.abs(modified_z) > (threshold if threshold != 1.5 else 3.5)
|
|
177
|
+
|
|
178
|
+
summary = pd.DataFrame({
|
|
179
|
+
"outlier_count": outlier_mask.sum(),
|
|
180
|
+
"outlier_pct": (outlier_mask.sum() / len(df) * 100).round(2),
|
|
181
|
+
})
|
|
182
|
+
|
|
183
|
+
return outlier_mask, summary
|
|
184
|
+
|
|
185
|
+
|
|
186
|
+
def handle_outliers(df, columns, method="iqr", action="clip"):
|
|
187
|
+
"""
|
|
188
|
+
外れ値を処理する。
|
|
189
|
+
|
|
190
|
+
action:
|
|
191
|
+
"clip" — 境界値にクリッピング
|
|
192
|
+
"remove" — 外れ値行を除去
|
|
193
|
+
"nan" — NaN に置換(後段で補完)
|
|
194
|
+
"""
|
|
195
|
+
outlier_mask, summary = detect_outliers(df, columns, method=method)
|
|
196
|
+
df = df.copy()
|
|
197
|
+
|
|
198
|
+
if action == "remove":
|
|
199
|
+
any_outlier = outlier_mask.any(axis=1)
|
|
200
|
+
df = df[~any_outlier]
|
|
201
|
+
|
|
202
|
+
elif action == "clip":
|
|
203
|
+
for col in columns:
|
|
204
|
+
vals = df[col].dropna()
|
|
205
|
+
Q1, Q3 = vals.quantile(0.25), vals.quantile(0.75)
|
|
206
|
+
IQR = Q3 - Q1
|
|
207
|
+
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
|
|
208
|
+
df[col] = df[col].clip(lower, upper)
|
|
209
|
+
|
|
210
|
+
elif action == "nan":
|
|
211
|
+
for col in columns:
|
|
212
|
+
df.loc[outlier_mask[col], col] = np.nan
|
|
213
|
+
|
|
214
|
+
return df, summary
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
## 5. エンコーディング — 選択ガイド
|
|
218
|
+
|
|
219
|
+
| 手法 | ユースケース | 適用例 |
|
|
220
|
+
|---|---|---|
|
|
221
|
+
| `LabelEncoder` | 順序カテゴリ、ツリー系モデル | 結晶構造タイプ → 0,1,2 |
|
|
222
|
+
| `pd.get_dummies()` | 名義カテゴリ、線形モデル | 材料名 → ダミー変数 |
|
|
223
|
+
| `OrdinalEncoder` | 順序のあるカテゴリ(明示的) | 低/中/高 → 0,1,2 |
|
|
224
|
+
| `TargetEncoder` | 高カーディナリティ | 化合物 ID |
|
|
225
|
+
|
|
226
|
+
```python
|
|
227
|
+
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder
|
|
228
|
+
|
|
229
|
+
def encode_categorical(df, columns, method="auto"):
|
|
230
|
+
"""
|
|
231
|
+
カテゴリカル変数のエンコーディング。
|
|
232
|
+
|
|
233
|
+
method:
|
|
234
|
+
"auto" — nunique に応じて自動選択
|
|
235
|
+
"label" — LabelEncoder
|
|
236
|
+
"onehot" — pd.get_dummies()
|
|
237
|
+
"ordinal" — OrdinalEncoder
|
|
238
|
+
"""
|
|
239
|
+
df = df.copy()
|
|
240
|
+
encoders = {}
|
|
241
|
+
|
|
242
|
+
for col in columns:
|
|
243
|
+
n_unique = df[col].nunique()
|
|
244
|
+
|
|
245
|
+
if method == "auto":
|
|
246
|
+
chosen = "onehot" if n_unique <= 10 else "label"
|
|
247
|
+
else:
|
|
248
|
+
chosen = method
|
|
249
|
+
|
|
250
|
+
if chosen == "label":
|
|
251
|
+
le = LabelEncoder()
|
|
252
|
+
df[col] = le.fit_transform(df[col].astype(str))
|
|
253
|
+
encoders[col] = le
|
|
254
|
+
|
|
255
|
+
elif chosen == "onehot":
|
|
256
|
+
dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
|
|
257
|
+
df = pd.concat([df.drop(columns=[col]), dummies], axis=1)
|
|
258
|
+
|
|
259
|
+
elif chosen == "ordinal":
|
|
260
|
+
oe = OrdinalEncoder()
|
|
261
|
+
df[col] = oe.fit_transform(df[[col]])
|
|
262
|
+
encoders[col] = oe
|
|
263
|
+
|
|
264
|
+
return df, encoders
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
## 6. スケーリング・変換 — 選択ガイド
|
|
268
|
+
|
|
269
|
+
| 手法 | 数式 | ユースケース |
|
|
270
|
+
|---|---|---|
|
|
271
|
+
| `StandardScaler` | $(x - \mu) / \sigma$ | PCA 前、正規分布仮定 |
|
|
272
|
+
| `MinMaxScaler` | $(x - x_{min}) / (x_{max} - x_{min})$ | NN 入力、0〜1 に正規化 |
|
|
273
|
+
| `RobustScaler` | $(x - Q_2) / (Q_3 - Q_1)$ | 外れ値に頑健 |
|
|
274
|
+
| Pareto Scaling | $(x - \bar{x}) / \sqrt{s}$ | メタボロミクス (Exp-07) |
|
|
275
|
+
| Log2 Transform | $\log_2(x + 1)$ | カウント・発現量データ |
|
|
276
|
+
| Box-Cox | $(x^\lambda - 1)/\lambda$ | 正規性改善(正値のみ) |
|
|
277
|
+
|
|
278
|
+
```python
|
|
279
|
+
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
|
|
280
|
+
|
|
281
|
+
def scale_features(df, columns, method="standard"):
|
|
282
|
+
"""
|
|
283
|
+
特徴量スケーリング。
|
|
284
|
+
|
|
285
|
+
method:
|
|
286
|
+
"standard" — StandardScaler (z-score)
|
|
287
|
+
"minmax" — MinMaxScaler (0-1)
|
|
288
|
+
"robust" — RobustScaler (IQR-based)
|
|
289
|
+
"pareto" — Pareto scaling (metabolomics 向き)
|
|
290
|
+
"log2" — log2(x + 1) 変換
|
|
291
|
+
"""
|
|
292
|
+
df = df.copy()
|
|
293
|
+
scaler = None
|
|
294
|
+
|
|
295
|
+
if method == "standard":
|
|
296
|
+
scaler = StandardScaler()
|
|
297
|
+
df[columns] = scaler.fit_transform(df[columns])
|
|
298
|
+
|
|
299
|
+
elif method == "minmax":
|
|
300
|
+
scaler = MinMaxScaler()
|
|
301
|
+
df[columns] = scaler.fit_transform(df[columns])
|
|
302
|
+
|
|
303
|
+
elif method == "robust":
|
|
304
|
+
scaler = RobustScaler()
|
|
305
|
+
df[columns] = scaler.fit_transform(df[columns])
|
|
306
|
+
|
|
307
|
+
elif method == "pareto":
|
|
308
|
+
# Pareto scaling: (x - mean) / sqrt(std)
|
|
309
|
+
for col in columns:
|
|
310
|
+
mean = df[col].mean()
|
|
311
|
+
std = df[col].std()
|
|
312
|
+
df[col] = (df[col] - mean) / np.sqrt(std + 1e-10)
|
|
313
|
+
|
|
314
|
+
elif method == "log2":
|
|
315
|
+
for col in columns:
|
|
316
|
+
df[col] = np.log2(df[col] + 1)
|
|
317
|
+
|
|
318
|
+
return df, scaler
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
## 7. 前処理パイプライン統合テンプレート
|
|
322
|
+
|
|
323
|
+
```python
|
|
324
|
+
def preprocessing_pipeline(df, target_col=None, config=None):
|
|
325
|
+
"""
|
|
326
|
+
全体前処理パイプライン。設定辞書で制御可能。
|
|
327
|
+
|
|
328
|
+
config 例:
|
|
329
|
+
{
|
|
330
|
+
"drop_duplicates": True,
|
|
331
|
+
"missing_strategy": "auto",
|
|
332
|
+
"outlier_method": "iqr",
|
|
333
|
+
"outlier_action": "clip",
|
|
334
|
+
"encoding_method": "auto",
|
|
335
|
+
"scaling_method": "standard",
|
|
336
|
+
}
|
|
337
|
+
"""
|
|
338
|
+
if config is None:
|
|
339
|
+
config = {
|
|
340
|
+
"drop_duplicates": True,
|
|
341
|
+
"missing_strategy": "auto",
|
|
342
|
+
"outlier_method": "iqr",
|
|
343
|
+
"outlier_action": "clip",
|
|
344
|
+
"encoding_method": "auto",
|
|
345
|
+
"scaling_method": "standard",
|
|
346
|
+
}
|
|
347
|
+
|
|
348
|
+
n_original = len(df)
|
|
349
|
+
|
|
350
|
+
# Step 0: 重複除去
|
|
351
|
+
if config.get("drop_duplicates", True):
|
|
352
|
+
df = df.drop_duplicates()
|
|
353
|
+
print(f" Removed {n_original - len(df)} duplicate rows")
|
|
354
|
+
|
|
355
|
+
# Step 1: カラム分類
|
|
356
|
+
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
|
|
357
|
+
categorical_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()
|
|
358
|
+
if target_col and target_col in numeric_cols:
|
|
359
|
+
numeric_cols.remove(target_col)
|
|
360
|
+
if target_col and target_col in categorical_cols:
|
|
361
|
+
categorical_cols.remove(target_col)
|
|
362
|
+
|
|
363
|
+
# Step 2: 欠損値処理
|
|
364
|
+
df = handle_missing_values(
|
|
365
|
+
df,
|
|
366
|
+
strategy=config.get("missing_strategy", "auto"),
|
|
367
|
+
numeric_cols=numeric_cols,
|
|
368
|
+
categorical_cols=categorical_cols,
|
|
369
|
+
)
|
|
370
|
+
|
|
371
|
+
# Step 3: 外れ値処理
|
|
372
|
+
if numeric_cols:
|
|
373
|
+
df, outlier_summary = handle_outliers(
|
|
374
|
+
df, numeric_cols,
|
|
375
|
+
method=config.get("outlier_method", "iqr"),
|
|
376
|
+
action=config.get("outlier_action", "clip"),
|
|
377
|
+
)
|
|
378
|
+
|
|
379
|
+
# Step 4: エンコーディング
|
|
380
|
+
if categorical_cols:
|
|
381
|
+
df, encoders = encode_categorical(
|
|
382
|
+
df, categorical_cols,
|
|
383
|
+
method=config.get("encoding_method", "auto"),
|
|
384
|
+
)
|
|
385
|
+
else:
|
|
386
|
+
encoders = {}
|
|
387
|
+
|
|
388
|
+
# Step 5: スケーリング
|
|
389
|
+
final_numeric = df.select_dtypes(include=[np.number]).columns.tolist()
|
|
390
|
+
if target_col and target_col in final_numeric:
|
|
391
|
+
final_numeric.remove(target_col)
|
|
392
|
+
if final_numeric:
|
|
393
|
+
df, scaler = scale_features(
|
|
394
|
+
df, final_numeric,
|
|
395
|
+
method=config.get("scaling_method", "standard"),
|
|
396
|
+
)
|
|
397
|
+
else:
|
|
398
|
+
scaler = None
|
|
399
|
+
|
|
400
|
+
print(f" Preprocessing complete: {df.shape}")
|
|
401
|
+
|
|
402
|
+
return df, {"encoders": encoders, "scaler": scaler}
|
|
403
|
+
```
|
|
404
|
+
|
|
405
|
+
## References
|
|
406
|
+
|
|
407
|
+
- **Exp-01**: `sc.pp.normalize_total()`, `sc.pp.log1p()` — scRNA-seq 固有の正規化
|
|
408
|
+
- **Exp-02**: `LabelEncoder`, `pd.get_dummies()` — 化合物記述子エンコーディング
|
|
409
|
+
- **Exp-03**: `StandardScaler` + `log2(x+1)` — がんデータ前処理
|
|
410
|
+
- **Exp-05**: `StandardScaler`, `KNNImputer` — 毒性予測データ
|
|
411
|
+
- **Exp-07**: Pareto scaling — メタボロミクスデータ
|
|
412
|
+
- **Exp-12**: `MinMaxScaler`, `RobustScaler` — プロセスデータ
|
|
413
|
+
- **Exp-13**: `LabelEncoder`, ダミー変数 — 材料タイプエンコーディング
|
|
@@ -0,0 +1,244 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: scientific-data-simulation
|
|
3
|
+
description: |
|
|
4
|
+
物理・化学・生物学に基づく合成データ生成のスキル。実験データが不足する場合に、
|
|
5
|
+
ドメイン知識を反映したシミュレーションデータを生成する際に使用。
|
|
6
|
+
Scientific Skills Exp-06, 07, 08, 09, 12, 13 で確立したパターン。
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
# Scientific Data Simulation & Generation
|
|
10
|
+
|
|
11
|
+
物理法則・化学モデル・生物学的知見に基づいて、現実的なシミュレーションデータを
|
|
12
|
+
生成するスキル。実データの不足を補い、ML パイプラインの開発・検証を可能にする。
|
|
13
|
+
|
|
14
|
+
## When to Use
|
|
15
|
+
|
|
16
|
+
- 実験データが未取得または不十分なとき
|
|
17
|
+
- ML パイプラインのプロトタイピングに合成データが必要なとき
|
|
18
|
+
- 特定の物理・化学法則に基づくデータ生成
|
|
19
|
+
- 既知の因果関係を組み込んだベンチマークデータの作成
|
|
20
|
+
|
|
21
|
+
## Quick Start
|
|
22
|
+
|
|
23
|
+
## 設計原則
|
|
24
|
+
|
|
25
|
+
1. **物理モデルベース**: 単純な正規分布ではなく、ドメインの因果関係を反映する
|
|
26
|
+
2. **ノイズの付加**: 実験の不確実性を模擬する適切なノイズレベル
|
|
27
|
+
3. **材料/群の多様性**: 複数の材料種・条件を生成し、群間差を反映
|
|
28
|
+
4. **範囲の現実性**: パラメータ範囲は実験装置・物理限界に基づく
|
|
29
|
+
|
|
30
|
+
## データ生成テンプレート
|
|
31
|
+
|
|
32
|
+
### 1. プロセスデータ生成(Exp-12, 13 パターン)
|
|
33
|
+
|
|
34
|
+
```python
|
|
35
|
+
import numpy as np
|
|
36
|
+
import pandas as pd
|
|
37
|
+
|
|
38
|
+
def generate_process_dataset(n_samples=500, seed=42):
|
|
39
|
+
"""
|
|
40
|
+
物理ベースのプロセスデータを生成するテンプレート。
|
|
41
|
+
因果関係: Process → Structure → Property
|
|
42
|
+
"""
|
|
43
|
+
rng = np.random.default_rng(seed)
|
|
44
|
+
|
|
45
|
+
# === Process パラメータ(独立変数)===
|
|
46
|
+
temperature = rng.uniform(25, 500, n_samples) # °C
|
|
47
|
+
pressure = rng.uniform(0.1, 5.0, n_samples) # Pa
|
|
48
|
+
power = rng.uniform(50, 500, n_samples) # W
|
|
49
|
+
time = rng.uniform(5, 120, n_samples) # min
|
|
50
|
+
|
|
51
|
+
# === Structure 変数(Process に依存)===
|
|
52
|
+
# 物理モデル:因果関係を組み込む
|
|
53
|
+
dep_rate = (
|
|
54
|
+
0.5
|
|
55
|
+
+ 0.02 * power # 出力依存
|
|
56
|
+
- 0.005 * pressure ** 2 # 高圧での飽和
|
|
57
|
+
+ rng.normal(0, 0.5, n_samples) # ノイズ
|
|
58
|
+
)
|
|
59
|
+
dep_rate = np.clip(dep_rate, 0.1, 30)
|
|
60
|
+
|
|
61
|
+
thickness = dep_rate * time
|
|
62
|
+
thickness = np.clip(thickness, 5, 2000)
|
|
63
|
+
|
|
64
|
+
crystallite_size = (
|
|
65
|
+
5
|
|
66
|
+
+ 0.1 * temperature # 温度依存(アレニウス的)
|
|
67
|
+
+ 0.01 * time # 時間依存
|
|
68
|
+
+ rng.normal(0, 2, n_samples) # ノイズ
|
|
69
|
+
)
|
|
70
|
+
crystallite_size = np.clip(crystallite_size, 2, 80)
|
|
71
|
+
|
|
72
|
+
# === Property 変数(Structure に依存)===
|
|
73
|
+
resistivity = (
|
|
74
|
+
1e-2
|
|
75
|
+
* np.exp(-0.005 * temperature) # 温度活性化
|
|
76
|
+
* (1 + 0.01 * pressure)
|
|
77
|
+
* np.exp(rng.normal(0, 0.3, n_samples)) # 対数正規ノイズ
|
|
78
|
+
)
|
|
79
|
+
|
|
80
|
+
transmittance = (
|
|
81
|
+
95
|
|
82
|
+
- 0.02 * thickness # 膜厚依存
|
|
83
|
+
+ 0.05 * crystallite_size # 結晶子サイズ依存
|
|
84
|
+
+ rng.normal(0, 1, n_samples)
|
|
85
|
+
)
|
|
86
|
+
transmittance = np.clip(transmittance, 40, 98)
|
|
87
|
+
|
|
88
|
+
df = pd.DataFrame({
|
|
89
|
+
"Temperature": temperature,
|
|
90
|
+
"Pressure": pressure,
|
|
91
|
+
"Power": power,
|
|
92
|
+
"Time": time,
|
|
93
|
+
"Deposition_Rate": dep_rate,
|
|
94
|
+
"Thickness": thickness,
|
|
95
|
+
"Crystallite_Size": crystallite_size,
|
|
96
|
+
"Resistivity": resistivity,
|
|
97
|
+
"Transmittance": transmittance,
|
|
98
|
+
})
|
|
99
|
+
|
|
100
|
+
return df
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
### 2. 多材料データ生成(Exp-13 パターン)
|
|
104
|
+
|
|
105
|
+
```python
|
|
106
|
+
def generate_multi_material_dataset(materials, n_per_material=100, seed=42):
|
|
107
|
+
"""
|
|
108
|
+
複数材料の PSP データを生成する。
|
|
109
|
+
materials: {"ZnO": {"Tm": 2248, "Eg": 3.3}, ...} の辞書
|
|
110
|
+
"""
|
|
111
|
+
rng = np.random.default_rng(seed)
|
|
112
|
+
all_data = []
|
|
113
|
+
|
|
114
|
+
for mat_name, props in materials.items():
|
|
115
|
+
n = n_per_material
|
|
116
|
+
Tm = props["Tm"] # 融点 (K)
|
|
117
|
+
Eg = props["Eg"] # バンドギャップ (eV)
|
|
118
|
+
|
|
119
|
+
# Process
|
|
120
|
+
Tsub = rng.uniform(25, 500, n)
|
|
121
|
+
Pwork = rng.uniform(0.1, 5.0, n)
|
|
122
|
+
Power = rng.uniform(50, 500, n)
|
|
123
|
+
|
|
124
|
+
# Structure(材料依存の因果関係)
|
|
125
|
+
T_homologous = (Tsub + 273.15) / Tm # 相同温度
|
|
126
|
+
crystallite = 5 + 80 * T_homologous + rng.normal(0, 3, n)
|
|
127
|
+
crystallite = np.clip(crystallite, 2, 80)
|
|
128
|
+
|
|
129
|
+
# Property(材料固有値 + プロセス依存変動)
|
|
130
|
+
bandgap = Eg + rng.normal(0, 0.1, n)
|
|
131
|
+
|
|
132
|
+
data = pd.DataFrame({
|
|
133
|
+
"Material": mat_name,
|
|
134
|
+
"Substrate_Temp": Tsub,
|
|
135
|
+
"Working_Pressure": Pwork,
|
|
136
|
+
"Power": Power,
|
|
137
|
+
"T_homologous": T_homologous,
|
|
138
|
+
"Crystallite_Size": crystallite,
|
|
139
|
+
"Bandgap": bandgap,
|
|
140
|
+
})
|
|
141
|
+
all_data.append(data)
|
|
142
|
+
|
|
143
|
+
return pd.concat(all_data, ignore_index=True)
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
### 3. 臨床試験データ生成(Exp-06 パターン)
|
|
147
|
+
|
|
148
|
+
```python
|
|
149
|
+
def generate_clinical_trial_data(n_total=500, effect_size=0.3, seed=42):
|
|
150
|
+
"""RCT シミュレーションデータを生成する。"""
|
|
151
|
+
rng = np.random.default_rng(seed)
|
|
152
|
+
n_per_arm = n_total // 2
|
|
153
|
+
|
|
154
|
+
# 人口統計
|
|
155
|
+
ages = np.concatenate([
|
|
156
|
+
rng.normal(55, 12, n_per_arm),
|
|
157
|
+
rng.normal(55, 12, n_per_arm),
|
|
158
|
+
])
|
|
159
|
+
sex = rng.choice(["M", "F"], n_total)
|
|
160
|
+
group = np.array(["Treatment"] * n_per_arm + ["Control"] * n_per_arm)
|
|
161
|
+
|
|
162
|
+
# 主要エンドポイント
|
|
163
|
+
baseline = rng.normal(100, 15, n_total)
|
|
164
|
+
treatment_effect = np.where(group == "Treatment", effect_size * 15, 0)
|
|
165
|
+
endpoint = baseline + treatment_effect + rng.normal(0, 10, n_total)
|
|
166
|
+
|
|
167
|
+
# 生存時間(指数分布)
|
|
168
|
+
survival_time = rng.exponential(
|
|
169
|
+
np.where(group == "Treatment", 365 * 2, 365 * 1.5),
|
|
170
|
+
n_total
|
|
171
|
+
)
|
|
172
|
+
event = rng.binomial(1, 0.7, n_total)
|
|
173
|
+
|
|
174
|
+
return pd.DataFrame({
|
|
175
|
+
"Patient_ID": range(1, n_total + 1),
|
|
176
|
+
"Group": group,
|
|
177
|
+
"Age": ages.astype(int),
|
|
178
|
+
"Sex": sex,
|
|
179
|
+
"Baseline": baseline,
|
|
180
|
+
"Endpoint": endpoint,
|
|
181
|
+
"Survival_Time": survival_time,
|
|
182
|
+
"Event": event,
|
|
183
|
+
})
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
### 4. スペクトルデータ生成(Exp-08, 11 パターン)
|
|
187
|
+
|
|
188
|
+
```python
|
|
189
|
+
def generate_spectrum(wavenumbers, peak_positions, peak_heights,
|
|
190
|
+
peak_widths, noise_level=0.02, seed=None):
|
|
191
|
+
"""
|
|
192
|
+
ガウスピーク合成によるスペクトルを生成する。
|
|
193
|
+
ラマン / IR / UV-Vis などに汎用。
|
|
194
|
+
"""
|
|
195
|
+
rng = np.random.default_rng(seed)
|
|
196
|
+
spectrum = np.zeros_like(wavenumbers, dtype=float)
|
|
197
|
+
|
|
198
|
+
for pos, height, width in zip(peak_positions, peak_heights, peak_widths):
|
|
199
|
+
spectrum += height * np.exp(-0.5 * ((wavenumbers - pos) / width) ** 2)
|
|
200
|
+
|
|
201
|
+
# ノイズ付加
|
|
202
|
+
spectrum += rng.normal(0, noise_level * spectrum.max(), len(wavenumbers))
|
|
203
|
+
return spectrum
|
|
204
|
+
|
|
205
|
+
|
|
206
|
+
def generate_ecg_beat(t, hr=72):
|
|
207
|
+
"""合成 ECG 波形(PQRST パターン)を生成する。"""
|
|
208
|
+
beat_duration = 60.0 / hr
|
|
209
|
+
# P 波、QRS 群、T 波のガウス重ね合わせ
|
|
210
|
+
p_wave = 0.1 * np.exp(-((t % beat_duration - 0.16) / 0.04) ** 2)
|
|
211
|
+
qrs = 1.0 * np.exp(-((t % beat_duration - 0.25) / 0.01) ** 2)
|
|
212
|
+
q_wave = -0.15 * np.exp(-((t % beat_duration - 0.22) / 0.015) ** 2)
|
|
213
|
+
s_wave = -0.1 * np.exp(-((t % beat_duration - 0.28) / 0.015) ** 2)
|
|
214
|
+
t_wave = 0.2 * np.exp(-((t % beat_duration - 0.40) / 0.05) ** 2)
|
|
215
|
+
return p_wave + q_wave + qrs + s_wave + t_wave
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
## 品質チェックリスト
|
|
219
|
+
|
|
220
|
+
生成データの品質を保証するためのチェック項目:
|
|
221
|
+
|
|
222
|
+
- [ ] パラメータ範囲が物理的に現実的か
|
|
223
|
+
- [ ] 因果関係の方向が正しいか(温度↑ → 結晶子サイズ↑ など)
|
|
224
|
+
- [ ] ノイズレベルが実験の再現性に対応するか
|
|
225
|
+
- [ ] 外れ値の割合が現実的か(通常 1-5%)
|
|
226
|
+
- [ ] 変数間の相関構造が既知の物理法則と一致するか
|
|
227
|
+
- [ ] 群間差が効果量として検出可能な水準か
|
|
228
|
+
|
|
229
|
+
## References
|
|
230
|
+
|
|
231
|
+
### Output Files
|
|
232
|
+
|
|
233
|
+
| ファイル | 形式 |
|
|
234
|
+
|---|---|
|
|
235
|
+
| `data/<dataset_name>.csv` | CSV |
|
|
236
|
+
|
|
237
|
+
#### 参照実験
|
|
238
|
+
|
|
239
|
+
- **Exp-06**: 臨床試験 RCT シミュレーション(500 名、2 群)
|
|
240
|
+
- **Exp-07**: メタボロミクス合成データ(100 サンプル × 200 代謝物)
|
|
241
|
+
- **Exp-08**: 合成 ECG/EEG 信号(PQRST、帯域合成)
|
|
242
|
+
- **Exp-09**: コドンバイアス反映ゲノム配列
|
|
243
|
+
- **Exp-12**: エッチングプロセスデータ(500 サンプル × 8 パラメータ)
|
|
244
|
+
- **Exp-13**: 薄膜成膜データ(600 サンプル × 6 材料 × PSP 3 階層)
|