@yibeichan/claude-skills 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +98 -0
  3. package/cli.js +272 -0
  4. package/install.py +240 -0
  5. package/package.json +44 -0
  6. package/skills/bidsapp-nidm-standards/SKILL.md +202 -0
  7. package/skills/bidsapp-nidm-standards/references/babs_config.md +20 -0
  8. package/skills/bidsapp-nidm-standards/references/cli_arguments.md +76 -0
  9. package/skills/bidsapp-nidm-standards/references/container_patterns.md +53 -0
  10. package/skills/bidsapp-nidm-standards/references/nidm_integration.md +403 -0
  11. package/skills/bidsapp-nidm-standards/references/repo_structure.md +121 -0
  12. package/skills/bidsapp-nidm-standards/references/testing_patterns.md +82 -0
  13. package/skills/dicom2fmriprep/SKILL.md +377 -0
  14. package/skills/dicom2fmriprep/evals/evals.json +26 -0
  15. package/skills/dicom2fmriprep/references/babs-details.md +407 -0
  16. package/skills/dicom2fmriprep/references/fmriprep-details.md +250 -0
  17. package/skills/dicom2fmriprep/references/heudiconv-details.md +243 -0
  18. package/skills/fmri-ssm/SKILL.md +317 -0
  19. package/skills/fmri-ssm/references/code_templates.md +1570 -0
  20. package/skills/fmri-ssm/references/downstream_analysis.md +680 -0
  21. package/skills/fmri-ssm/references/group_inference.md +608 -0
  22. package/skills/fmri-ssm/references/hrf_modeling.md +447 -0
  23. package/skills/fmri-ssm/references/model_catalog.md +436 -0
  24. package/skills/fmri-ssm/references/paradigm_guide.md +406 -0
  25. package/skills/fmri-ssm/references/preprocessing.md +614 -0
  26. package/skills/fmri-ssm.zip +0 -0
  27. package/skills/neuroimaging-qc/SKILL.md +203 -0
  28. package/skills/neuroimaging-qc/references/eeg_qc.md +400 -0
  29. package/skills/neuroimaging-qc/references/fmri_qc.md +343 -0
  30. package/skills/neuroimaging-qc/references/fnirs_qc.md +430 -0
  31. package/skills/neuroimaging-qc/references/structural_qc.md +454 -0
  32. package/skills/neuroimaging-qc/scripts/parse_fmriprep_confounds.py +153 -0
  33. package/skills/neuroimaging-qc/scripts/parse_mriqc.py +114 -0
  34. package/skills/neuroimaging-qc/scripts/qc_report.py +295 -0
  35. package/skills/scientific-writer/SKILL.md +202 -0
  36. package/skills/scientific-writer/references/citation_styles.md +163 -0
  37. package/skills/scientific-writer/references/field_conventions.md +245 -0
  38. package/skills/scientific-writer/references/figures_tables.md +225 -0
  39. package/skills/scientific-writer/references/reporting_guidelines.md +225 -0
  40. package/skills.json +54 -0
@@ -0,0 +1,614 @@
1
+ # Preprocessing fMRI Data for State-Space Modeling
2
+
3
+ ## Table of Contents
4
+ 1. [fMRIPrep Output Structure](#fmriprep-outputs)
5
+ 2. [XCP-D Denoising for SSMs](#xcpd)
6
+ 3. [Confound Strategy](#confounds)
7
+ 4. [Dimensionality Reduction](#dim-reduction)
8
+ 5. [Parcellation](#parcellation)
9
+ 6. [CIFTI Surface-Based Processing](#cifti)
10
+ 7. [ICA-Based Approaches](#ica)
11
+ 8. [Temporal Filtering](#filtering)
12
+ 9. [Data Quality Checks Before SSM Fitting](#qc)
13
+ 10. [Preparing the Data Matrix](#data-matrix)
14
+
15
+ ---
16
+
17
+ ## 1. fMRIPrep Output Structure {#fmriprep-outputs}
18
+
19
+ fMRIPrep produces minimally preprocessed data with extensive metadata. Key outputs for SSMs:
20
+
21
+ **BOLD data (choose one):**
22
+ - `*_space-MNI152NLin6Asym_res-2_desc-preproc_bold.nii.gz` — volumetric, MNI space
23
+ - `*_space-fsLR_den-91k_bold.dtseries.nii` — CIFTI surface (preferred for surface analyses)
24
+ - `*_space-T1w_desc-preproc_bold.nii.gz` — native T1w space (for subject-specific parcellations)
25
+
26
+ **Confounds file:**
27
+ - `*_desc-confounds_timeseries.tsv` — all computed confounds (100+ columns)
28
+ - Use selectively — do NOT regress out everything
29
+
30
+ **Brain masks:**
31
+ - `*_space-MNI152NLin6Asym_res-2_desc-brain_mask.nii.gz`
32
+
33
+ **Transforms (for custom parcellation):**
34
+ - `*_from-MNI152NLin6Asym_to-T1w_mode-image_xfm.h5` (and reverse)
35
+
36
+ ---
37
+
38
+ ## 2. XCP-D Denoising for SSMs {#xcpd}
39
+
40
+ XCP-D applies denoising strategies to fMRIPrep outputs. For SSM analyses, the key choices are:
41
+
42
+ **Recommended pipeline:** `36P` or `acompcor` denoising strategy
43
+
44
+ **36P strategy:**
45
+ - 6 motion parameters + their temporal derivatives + quadratic terms (24 motion regressors)
46
+ - Mean WM signal + derivative + quadratic (4 regressors)
47
+ - Mean CSF signal + derivative + quadratic (4 regressors)
48
+ - Mean global signal + derivative + quadratic (4 regressors) — CONTROVERSIAL, see below
49
+
50
+ **aCompCor strategy (alternative):**
51
+ - Top 5 aCompCor components from WM + CSF
52
+ - 6 motion parameters + temporal derivatives
53
+ - Avoids global signal regression
54
+
55
+ **Global signal regression (GSR) — the controversy for SSMs:**
56
+ GSR removes variance shared across all brain regions. This:
57
+ - Removes global arousal/drowsiness fluctuations (often desirable for resting-state SSMs)
58
+ - But introduces mathematical anticorrelations between regions
59
+ - These anticorrelations can create artifactual "anticorrelated" states in HMMs
60
+ - **Recommendation:** Run analyses both with and without GSR. If your states change
61
+ dramatically, the GSR-sensitive states may be artifacts.
62
+
63
+ **XCP-D output for SSMs:**
64
+ ```
65
+ xcp_d/sub-01/func/
66
+ sub-01_task-rest_space-MNI152NLin6Asym_res-2_desc-denoised_bold.nii.gz
67
+ sub-01_task-rest_space-fsLR_den-91k_desc-denoised_bold.dtseries.nii
68
+ sub-01_task-rest_desc-confounds_timeseries.tsv # residual confounds
69
+ ```
70
+
71
+ ---
72
+
73
+ ## 3. Confound Strategy {#confounds}
74
+
75
+ ### Minimum recommended confounds (if not using XCP-D)
76
+
77
+ ```python
78
+ import pandas as pd
79
+
80
+ def load_confounds_for_ssm(confounds_file, strategy='moderate'):
81
+ """Load fMRIPrep confounds appropriate for SSM analysis.
82
+
83
+ Parameters
84
+ ----------
85
+ confounds_file : str
86
+ Path to *_desc-confounds_timeseries.tsv
87
+ strategy : str
88
+ 'minimal': 6 motion + WM + CSF (12 regressors)
89
+ 'moderate': 24 motion + aCompCor top 5 (~29 regressors)
90
+ 'aggressive': 36P (36 regressors, includes GSR)
91
+ """
92
+ df = pd.read_csv(confounds_file, sep='\t')
93
+
94
+ # Motion parameters (always include)
95
+ motion_cols = ['trans_x', 'trans_y', 'trans_z', 'rot_x', 'rot_y', 'rot_z']
96
+
97
+ if strategy == 'minimal':
98
+ confound_cols = motion_cols + ['csf', 'white_matter']
99
+
100
+ elif strategy == 'moderate':
101
+ motion_derivs = [f'{c}_derivative1' for c in motion_cols]
102
+ motion_power2 = [f'{c}_power2' for c in motion_cols]
103
+ motion_deriv_power2 = [f'{c}_derivative1_power2' for c in motion_cols]
104
+ acompcor = [c for c in df.columns if c.startswith('a_comp_cor_')][:5]
105
+ confound_cols = motion_cols + motion_derivs + motion_power2 + motion_deriv_power2 + acompcor
106
+
107
+ elif strategy == 'aggressive':
108
+ confound_cols = [c for c in df.columns if any(c.startswith(p) for p in
109
+ ['trans_', 'rot_', 'csf', 'white_matter', 'global_signal'])]
110
+ # Keep only the 36P set
111
+ confound_cols = [c for c in confound_cols if not c.startswith('a_comp_cor')]
112
+
113
+ confounds = df[confound_cols].values
114
+ # Handle NaN in first row (derivatives)
115
+ confounds = np.nan_to_num(confounds, nan=0.0)
116
+
117
+ return confounds, confound_cols
118
+ ```
119
+
120
+ ### Motion scrubbing / censoring
121
+
122
+ High-motion time points can create artifactual states. Two approaches:
123
+
124
+ **Approach A: Scrub before fitting (recommended)**
125
+ Remove high-motion TRs (framewise displacement > 0.5mm) and their neighbors. For the
126
+ remaining gaps, use this strategy based on gap size:
127
+
128
+ - **Short gaps (1–2 consecutive censored TRs):** Linearly interpolate across the gap so
129
+ the HMM sees a continuous sequence without an abrupt discontinuity. Interpolated TRs
130
+ do not contribute real dynamics but prevent boundary artifacts.
131
+ - **Longer gaps (≥3 consecutive censored TRs):** Treat as a run boundary — pass the gap
132
+ endpoints as separate segments in the `lengths` array. Do NOT interpolate across long
133
+ gaps; the interpolated signal would be fabricated.
134
+
135
+ **Approach B: Flag and verify after fitting**
136
+ Fit the SSM on all data, then check if any states correlate with framewise displacement.
137
+ If a state's occupancy correlates with FD > 0.3, it's likely motion-driven.
138
+
139
+ ```python
140
+ def identify_high_motion_trs(confounds_file, fd_threshold=0.5, n_before=0, n_after=2):
141
+ """Identify TRs to censor due to high motion.
142
+
143
+ Returns boolean mask: True = keep, False = censor.
144
+ """
145
+ df = pd.read_csv(confounds_file, sep='\t')
146
+ fd = df['framewise_displacement'].values
147
+ fd[0] = 0 # First TR has no FD
148
+
149
+ censor = fd > fd_threshold
150
+
151
+ # Expand censoring to neighbors
152
+ censor_expanded = censor.copy()
153
+ for i in range(len(censor)):
154
+ if censor[i]:
155
+ start = max(0, i - n_before)
156
+ end = min(len(censor), i + n_after + 1)
157
+ censor_expanded[start:end] = True
158
+
159
+ keep_mask = ~censor_expanded
160
+ pct_removed = 100 * censor_expanded.sum() / len(censor_expanded)
161
+ print(f"Censoring {censor_expanded.sum()}/{len(censor)} TRs ({pct_removed:.1f}%)")
162
+
163
+ if pct_removed > 25:
164
+ print("WARNING: >25% of data censored. Consider excluding this run.")
165
+
166
+ return keep_mask
167
+ ```
168
+
169
+ ---
170
+
171
+ ## 4. Dimensionality Reduction {#dim-reduction}
172
+
173
+ ### When to reduce dimensions
174
+ - ROI timeseries from fine parcellations (>100 ROIs): full-covariance HMMs may need reduction
175
+ - ICA with many components (>50): consider selecting or reducing
176
+ - CIFTI / voxel-level data: always reduce before SSM fitting
177
+ - Rule of thumb: n_features should be at most T / (10 × K) for stable full-covariance estimation
178
+
179
+ ### PCA
180
+
181
+ ```python
182
+ from sklearn.decomposition import PCA
183
+
184
+ def reduce_dimensions_pca(bold_data, n_components=None, variance_explained=0.95):
185
+ """PCA dimensionality reduction for SSM input.
186
+
187
+ Parameters
188
+ ----------
189
+ bold_data : array, shape (T, n_features)
190
+ n_components : int or None
191
+ Fixed number of components. If None, use variance_explained.
192
+ variance_explained : float
193
+ Target cumulative variance explained (used if n_components is None)
194
+ """
195
+ if n_components is not None:
196
+ pca = PCA(n_components=n_components)
197
+ else:
198
+ pca = PCA(n_components=variance_explained)
199
+
200
+ reduced = pca.fit_transform(bold_data)
201
+ print(f"Reduced {bold_data.shape[1]} features to {reduced.shape[1]} components")
202
+ print(f"Cumulative variance explained: {pca.explained_variance_ratio_.sum():.3f}")
203
+
204
+ return reduced, pca
205
+ ```
206
+
207
+ ### Recommended preprocessing order before dimensionality reduction
208
+
209
+ Apply steps in this order to avoid introducing artifacts:
210
+ 1. **Confound regression** — regress out motion, WM/CSF signals, aCompCor components
211
+ 2. **Z-score per region** — zero mean and unit variance across time (per ROI)
212
+ 3. **PCA or ICA** — after z-scoring, so PCA components reflect variance structure, not scale
213
+
214
+ Reversing steps 2 and 3 (PCA before z-scoring) can bias components toward high-variance
215
+ regions (e.g., large subcortical structures), not the most informative regions.
216
+
217
+ ### Which dimensionality reduction for which model?
218
+
219
+ | Model | Recommended approach | Typical n_components |
220
+ |-------|---------------------|---------------------|
221
+ | Gaussian HMM (full cov) | PCA or parcellation | 15-50 |
222
+ | Gaussian HMM (diag cov) | Parcellation alone is fine | 50-400 |
223
+ | HMM-MAR | ICA or PCA (mandatory) | 15-25 |
224
+ | SLDS/rSLDS | PCA or parcellation | 20-50 (observation), 5-15 (latent) |
225
+
226
+ ---
227
+
228
+ ## 5. Parcellation {#parcellation}
229
+
230
+ ### Common parcellation atlases
231
+
232
+ | Atlas | Resolutions | Space | Notes |
233
+ |-------|------------|-------|-------|
234
+ | Schaefer | 100, 200, 300, 400, 500, 600, 800, 1000 | MNI, fsLR | Most popular for HMMs. Comes with 7- and 17-network labels. |
235
+ | Gordon | 333 parcels | MNI, fsLR | Good community detection-based parcellation |
236
+ | Glasser (HCP-MMP) | 360 parcels | fsLR (surface) | Multimodal parcellation, surface-based |
237
+ | AAL | 90/116 regions | MNI | Older, anatomical. Still used but Schaefer preferred. |
238
+ | Harvard-Oxford | 48/96 regions | MNI | Probabilistic, anatomical |
239
+ | Tian (subcortical) | 16/32/50 scales | MNI | Pair with Schaefer for subcortical coverage |
240
+
241
+ ### Parcellating with nilearn
242
+
243
+ ```python
244
+ from nilearn import datasets, maskers
245
+ import numpy as np
246
+
247
+ def parcellate_bold(bold_file, atlas='schaefer', n_rois=200, tr=2.0,
248
+ confounds=None, standardize='zscore_sample'):
249
+ """Extract parcellated timeseries from BOLD data.
250
+
251
+ Parameters
252
+ ----------
253
+ bold_file : str
254
+ Path to preprocessed BOLD NIfTI file
255
+ atlas : str
256
+ 'schaefer', 'gordon', 'aal', 'harvard_oxford'
257
+ n_rois : int
258
+ Number of ROIs (for Schaefer)
259
+ tr : float
260
+ Repetition time in seconds. Required when high_pass is set — nilearn uses
261
+ it to convert the high_pass frequency cutoff to a scan-count cutoff.
262
+ confounds : array or None
263
+ Confound matrix to regress out during extraction
264
+ standardize : str
265
+ 'zscore_sample' recommended for SSMs (zero mean, unit variance per region)
266
+ """
267
+ if atlas == 'schaefer':
268
+ atlas_data = datasets.fetch_atlas_schaefer_2018(
269
+ n_rois=n_rois, resolution_mm=2
270
+ )
271
+ labels_img = atlas_data.maps
272
+ elif atlas == 'gordon':
273
+ atlas_data = datasets.fetch_atlas_gordon_2016()
274
+ labels_img = atlas_data.maps
275
+
276
+ masker = maskers.NiftiLabelsMasker(
277
+ labels_img=labels_img,
278
+ standardize=standardize,
279
+ detrend=True,
280
+ high_pass=0.01, # Remove very slow drift (requires t_r to be set)
281
+ t_r=tr, # REQUIRED when high_pass is set; without it filtering is silently skipped
282
+ memory='nilearn_cache',
283
+ )
284
+
285
+ timeseries = masker.fit_transform(bold_file, confounds=confounds)
286
+ print(f"Extracted timeseries: {timeseries.shape} (TRs × ROIs)")
287
+
288
+ return timeseries, masker
289
+
290
+ def add_subcortical(cortical_ts, bold_file, confounds=None):
291
+ """Add subcortical ROIs (Tian atlas) to cortical parcellation."""
292
+ # Tian subcortical atlas — 16-parcel scale
293
+ tian = datasets.fetch_atlas_tian_2020(resolution=2)
294
+ masker_sub = maskers.NiftiLabelsMasker(
295
+ labels_img=tian.maps,
296
+ standardize='zscore_sample',
297
+ detrend=True,
298
+ )
299
+ subcort_ts = masker_sub.fit_transform(bold_file, confounds=confounds)
300
+ combined = np.hstack([cortical_ts, subcort_ts])
301
+ print(f"Combined: {combined.shape} ({cortical_ts.shape[1]} cortical + {subcort_ts.shape[1]} subcortical)")
302
+ return combined
303
+ ```
304
+
305
+ ---
306
+
307
+ ## 6. CIFTI Surface-Based Processing {#cifti}
308
+
309
+ CIFTI files (.dtseries.nii) contain surface vertices (L/R cortex) + subcortical voxels in a
310
+ single file. This is the preferred format for HCP-style analyses and preserves cortical
311
+ topology better than volumetric approaches.
312
+
313
+ ### Loading CIFTI data
314
+
315
+ ```python
316
+ import nibabel as nib
317
+ import numpy as np
318
+
319
+ def load_cifti_timeseries(cifti_file):
320
+ """Load a CIFTI dtseries file and return timeseries + metadata."""
321
+ img = nib.load(cifti_file)
322
+ data = img.get_fdata() # shape: (T, n_greyordinates)
323
+
324
+ # Get brain model information
325
+ axes = [img.header.get_axis(i) for i in range(img.ndim)]
326
+ brain_axis = axes[1] # BrainModelAxis
327
+
328
+ # Identify structures
329
+ structures = {}
330
+ for name, indices, model in brain_axis.iter_structures():
331
+ structures[str(name)] = {
332
+ 'indices': indices,
333
+ 'n_vertices': len(range(indices.start, indices.stop)),
334
+ }
335
+
336
+ print(f"CIFTI shape: {data.shape}")
337
+ for name, info in structures.items():
338
+ print(f" {name}: {info['n_vertices']} greyordinates")
339
+
340
+ return data, img, structures
341
+
342
+ def parcellate_cifti(cifti_file, dlabel_file):
343
+ """Parcellate CIFTI timeseries using a dlabel parcellation.
344
+
345
+ Parameters
346
+ ----------
347
+ cifti_file : str
348
+ Path to .dtseries.nii
349
+ dlabel_file : str
350
+ Path to .dlabel.nii parcellation (e.g., Schaefer on fsLR)
351
+
352
+ Returns
353
+ -------
354
+ parcellated : array, shape (T, n_parcels)
355
+ """
356
+ bold_img = nib.load(cifti_file)
357
+ bold_data = bold_img.get_fdata() # (T, n_greyordinates)
358
+
359
+ label_img = nib.load(dlabel_file)
360
+ labels = label_img.get_fdata().squeeze() # (n_greyordinates,)
361
+
362
+ unique_labels = np.unique(labels)
363
+ unique_labels = unique_labels[unique_labels > 0] # remove background
364
+
365
+ parcellated = np.zeros((bold_data.shape[0], len(unique_labels)))
366
+ for i, label in enumerate(unique_labels):
367
+ mask = labels == label
368
+ parcellated[:, i] = bold_data[:, mask].mean(axis=1)
369
+
370
+ # Z-score each parcel
371
+ parcellated = (parcellated - parcellated.mean(axis=0)) / parcellated.std(axis=0)
372
+
373
+ print(f"Parcellated CIFTI: {parcellated.shape}")
374
+ return parcellated
375
+ ```
376
+
377
+ ### Workbench command-line tools for CIFTI
378
+
379
+ ```bash
380
+ # Parcellate CIFTI using wb_command (fast, handles structures correctly)
381
+ wb_command -cifti-parcellate \
382
+ sub-01_bold.dtseries.nii \
383
+ Schaefer2018_200Parcels_17Networks.dlabel.nii \
384
+ COLUMN \
385
+ sub-01_bold_parcellated.ptseries.nii
386
+
387
+ # Smooth on surface before parcellation (recommended: 4-6mm FWHM)
388
+ wb_command -cifti-smoothing \
389
+ sub-01_bold.dtseries.nii \
390
+ 4 4 COLUMN \
391
+ sub-01_bold_smoothed.dtseries.nii \
392
+ -left-surface sub-01.L.midthickness.32k_fs_LR.surf.gii \
393
+ -right-surface sub-01.R.midthickness.32k_fs_LR.surf.gii
394
+ ```
395
+
396
+ ---
397
+
398
+ ## 7. ICA-Based Approaches {#ica}
399
+
400
+ ### When to use ICA instead of parcellation
401
+ - When you want data-driven spatial features (not constrained by atlas boundaries)
402
+ - When you expect the relevant signals to be spatially distributed/overlapping
403
+ - When using HMM-MAR (ICA components are the standard input)
404
+ - When the number of meaningful dimensions is unknown
405
+
406
+ ### Group ICA with FSL MELODIC
407
+
408
+ ```bash
409
+ # Run group ICA (typically 15-50 components for SSM input)
410
+ melodic -i bold_files_list.txt \
411
+ -o group_ica_output \
412
+ --dim=25 \
413
+ --tr=0.8 \
414
+ --Oall \
415
+ --report
416
+ ```
417
+
418
+ ### Extracting subject-level ICA timeseries (dual regression)
419
+
420
+ ```bash
421
+ # Dual regression: project group ICA maps onto individual data
422
+ dual_regression \
423
+ group_ica_output/melodic_IC.nii.gz \
424
+ 1 \ # variance normalization
425
+ -1 \ # no permutation testing
426
+ output_dir \
427
+ bold_files_list.txt
428
+ ```
429
+
430
+ ### Using nilearn for ICA
431
+
432
+ ```python
433
+ from nilearn.decomposition import CanICA
434
+
435
+ def run_group_ica(bold_files, n_components=25, random_state=42):
436
+ """Run group ICA on multiple subjects using nilearn."""
437
+ canica = CanICA(
438
+ n_components=n_components,
439
+ memory='nilearn_cache',
440
+ memory_level=2,
441
+ threshold=3.,
442
+ n_init=10,
443
+ random_state=random_state,
444
+ )
445
+ canica.fit(bold_files)
446
+
447
+ # Extract timeseries for each subject
448
+ all_timeseries = []
449
+ for bold_file in bold_files:
450
+ ts = canica.transform([bold_file])[0] # (T, n_components)
451
+ all_timeseries.append(ts)
452
+
453
+ return all_timeseries, canica
454
+ ```
455
+
456
+ ---
457
+
458
+ ## 8. Temporal Filtering {#filtering}
459
+
460
+ ### High-pass filtering
461
+ fMRIPrep applies cosine drift regressors (default: 128s cutoff). XCP-D may apply additional
462
+ filtering. For SSMs, slow drift removal is important — otherwise, slow drift can be
463
+ mistaken for a "state."
464
+
465
+ **Recommended:** High-pass filter at 0.01 Hz (100s period) or use cosine regressors.
466
+ Do NOT use aggressive high-pass (>0.03 Hz) as this can remove real slow state dynamics.
467
+
468
+ ### Low-pass filtering
469
+ Generally NOT recommended for SSMs. Low-pass filtering removes the high-frequency information
470
+ that distinguishes states. Exception: if you have very fast TR (<0.5s) and want to remove
471
+ cardiac/respiratory aliasing, a gentle low-pass at 0.2 Hz may help.
472
+
473
+ ### Band-pass filtering
474
+ Some resting-state analyses use 0.01-0.1 Hz bandpass. This is standard for FC analyses
475
+ but overly aggressive for SSMs — it removes fast transitions that SSMs are designed to detect.
476
+ **Recommendation:** Use 0.01 Hz high-pass only, no low-pass, unless you have specific reasons.
477
+
478
+ ---
479
+
480
+ ## 9. Data Quality Checks Before SSM Fitting {#qc}
481
+
482
+ Run these checks BEFORE fitting any SSM:
483
+
484
+ ```python
485
+ def pre_ssm_quality_checks(timeseries, confounds_file, tr):
486
+ """Quality checks for SSM input data.
487
+
488
+ Parameters
489
+ ----------
490
+ timeseries : array, shape (T, n_features)
491
+ confounds_file : str
492
+ tr : float
493
+ """
494
+ import matplotlib.pyplot as plt
495
+
496
+ T, p = timeseries.shape
497
+ df = pd.read_csv(confounds_file, sep='\t')
498
+
499
+ # 1. Check for NaN/Inf
500
+ n_nan = np.isnan(timeseries).sum()
501
+ n_inf = np.isinf(timeseries).sum()
502
+ print(f"NaN values: {n_nan}, Inf values: {n_inf}")
503
+ assert n_nan == 0 and n_inf == 0, "Data contains NaN or Inf — fix preprocessing"
504
+
505
+ # 2. Check temporal SNR per region
506
+ tsnr = timeseries.mean(axis=0) / timeseries.std(axis=0)
507
+ low_tsnr = (tsnr < 20).sum()
508
+ print(f"Regions with tSNR < 20: {low_tsnr}/{p}")
509
+ if low_tsnr > p * 0.1:
510
+ print("WARNING: >10% of regions have low tSNR")
511
+
512
+ # 3. Check motion
513
+ fd = df['framewise_displacement'].values
514
+ fd[0] = 0
515
+ mean_fd = fd.mean()
516
+ pct_high = 100 * (fd > 0.5).sum() / len(fd)
517
+ print(f"Mean FD: {mean_fd:.3f} mm, TRs with FD>0.5mm: {pct_high:.1f}%")
518
+ if mean_fd > 0.3:
519
+ print("WARNING: High mean motion — consider excluding")
520
+
521
+ # 4. Check variance across regions (detect dead regions)
522
+ region_var = timeseries.var(axis=0)
523
+ dead_regions = (region_var < 1e-6).sum()
524
+ print(f"Dead regions (near-zero variance): {dead_regions}")
525
+
526
+ # 5. Check for extreme outliers
527
+ z_scores = np.abs((timeseries - timeseries.mean(0)) / timeseries.std(0))
528
+ extreme_trs = (z_scores > 5).any(axis=1).sum()
529
+ print(f"TRs with extreme outliers (|z|>5): {extreme_trs}/{T}")
530
+
531
+ # 6. Scan length adequacy
532
+ min_trs_per_state_k8 = 8 * 50 # rough: 50 TRs per state for K=8
533
+ print(f"Total TRs: {T} ({T*tr/60:.1f} minutes)")
534
+ print(f"Rough max K for stable estimation (full cov): ~{T // 50}")
535
+
536
+ return {
537
+ 'tsnr': tsnr, 'mean_fd': mean_fd, 'pct_high_motion': pct_high,
538
+ 'dead_regions': dead_regions, 'extreme_trs': extreme_trs,
539
+ }
540
+ ```
541
+
542
+ ---
543
+
544
+ ## 10. Preparing the Data Matrix {#data-matrix}
545
+
546
+ ### Final assembly for SSM fitting
547
+
548
+ ```python
549
+ def prepare_ssm_input(bold_files, confounds_files, parcellation='schaefer200',
550
+ standardize=True, concat_runs=True, tr=None):
551
+ """Full pipeline from fMRIPrep outputs to SSM-ready data matrix.
552
+
553
+ Returns
554
+ -------
555
+ data : array or list of arrays
556
+ If concat_runs: single array (total_T, n_features) with run_boundaries
557
+ If not: list of arrays, one per run
558
+ run_boundaries : list of int
559
+ TR indices where runs start (for resetting HMM forward algorithm)
560
+ """
561
+ all_runs = []
562
+ run_boundaries = [0]
563
+
564
+ for bold_file, confounds_file in zip(bold_files, confounds_files):
565
+ # 1. Load confounds
566
+ confounds, _ = load_confounds_for_ssm(confounds_file, strategy='moderate')
567
+
568
+ # 2. Parcellate (with confound regression built in)
569
+ ts, masker = parcellate_bold(
570
+ bold_file, atlas='schaefer', n_rois=200,
571
+ confounds=confounds, standardize='zscore_sample'
572
+ )
573
+
574
+ # 3. Quality check
575
+ qc = pre_ssm_quality_checks(ts, confounds_file, tr)
576
+
577
+ all_runs.append(ts)
578
+ run_boundaries.append(run_boundaries[-1] + ts.shape[0])
579
+
580
+ if concat_runs:
581
+ data = np.vstack(all_runs)
582
+ print(f"Concatenated: {data.shape}")
583
+ if standardize:
584
+ data = (data - data.mean(axis=0)) / data.std(axis=0)
585
+ return data, run_boundaries[:-1] # exclude last boundary
586
+ else:
587
+ return all_runs, run_boundaries[:-1]
588
+ ```
589
+
590
+ ### Handling run boundaries in HMMs
591
+
592
+ When concatenating runs, you MUST tell the HMM where run boundaries are. Otherwise it will
593
+ try to model transitions from the end of run N to the start of run N+1, which are not
594
+ real transitions.
595
+
596
+ ```python
597
+ # hmmlearn supports this via the 'lengths' parameter
598
+ lengths = [run.shape[0] for run in all_runs]
599
+ data_concat = np.vstack(all_runs)
600
+
601
+ model = hmm.GaussianHMM(n_components=K, n_iter=200)
602
+ model.fit(data_concat, lengths=lengths)
603
+
604
+ # For prediction, also pass lengths
605
+ states = model.predict(data_concat, lengths=lengths)
606
+ ```
607
+
608
+ For the `ssm` library, fit on a list of timeseries instead:
609
+ ```python
610
+ import ssm
611
+
612
+ model = ssm.HMM(K, D, observations='gaussian')
613
+ model.fit([run for run in all_runs]) # pass list of arrays
614
+ ```
Binary file