@bgicli/bgicli 2.2.8 → 2.2.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (113) hide show
  1. package/data/skills/anthropic-algorithmic-art/SKILL.md +405 -0
  2. package/data/skills/anthropic-canvas-design/SKILL.md +130 -0
  3. package/data/skills/anthropic-claude-api/SKILL.md +243 -0
  4. package/data/skills/anthropic-doc-coauthoring/SKILL.md +375 -0
  5. package/data/skills/anthropic-docx/SKILL.md +590 -0
  6. package/data/skills/anthropic-frontend-design/SKILL.md +42 -0
  7. package/data/skills/anthropic-internal-comms/SKILL.md +32 -0
  8. package/data/skills/anthropic-mcp-builder/SKILL.md +236 -0
  9. package/data/skills/anthropic-pdf/SKILL.md +314 -0
  10. package/data/skills/anthropic-pptx/SKILL.md +232 -0
  11. package/data/skills/anthropic-skill-creator/SKILL.md +485 -0
  12. package/data/skills/anthropic-webapp-testing/SKILL.md +96 -0
  13. package/data/skills/anthropic-xlsx/SKILL.md +292 -0
  14. package/data/skills/arxiv-database/SKILL.md +362 -0
  15. package/data/skills/astropy/SKILL.md +329 -0
  16. package/data/skills/ctx-advanced-evaluation/SKILL.md +402 -0
  17. package/data/skills/ctx-bdi-mental-states/SKILL.md +311 -0
  18. package/data/skills/ctx-context-compression/SKILL.md +272 -0
  19. package/data/skills/ctx-context-degradation/SKILL.md +206 -0
  20. package/data/skills/ctx-context-fundamentals/SKILL.md +201 -0
  21. package/data/skills/ctx-context-optimization/SKILL.md +195 -0
  22. package/data/skills/ctx-evaluation/SKILL.md +251 -0
  23. package/data/skills/ctx-filesystem-context/SKILL.md +287 -0
  24. package/data/skills/ctx-hosted-agents/SKILL.md +260 -0
  25. package/data/skills/ctx-memory-systems/SKILL.md +225 -0
  26. package/data/skills/ctx-multi-agent-patterns/SKILL.md +257 -0
  27. package/data/skills/ctx-project-development/SKILL.md +291 -0
  28. package/data/skills/ctx-tool-design/SKILL.md +271 -0
  29. package/data/skills/dhdna-profiler/SKILL.md +162 -0
  30. package/data/skills/generate-image/SKILL.md +183 -0
  31. package/data/skills/geomaster/SKILL.md +365 -0
  32. package/data/skills/get-available-resources/SKILL.md +275 -0
  33. package/data/skills/hamelsmu-build-review-interface/SKILL.md +96 -0
  34. package/data/skills/hamelsmu-error-analysis/SKILL.md +164 -0
  35. package/data/skills/hamelsmu-eval-audit/SKILL.md +183 -0
  36. package/data/skills/hamelsmu-evaluate-rag/SKILL.md +177 -0
  37. package/data/skills/hamelsmu-generate-synthetic-data/SKILL.md +131 -0
  38. package/data/skills/hamelsmu-validate-evaluator/SKILL.md +212 -0
  39. package/data/skills/hamelsmu-write-judge-prompt/SKILL.md +144 -0
  40. package/data/skills/hf-cli/SKILL.md +174 -0
  41. package/data/skills/hf-mcp/SKILL.md +178 -0
  42. package/data/skills/hugging-face-dataset-viewer/SKILL.md +121 -0
  43. package/data/skills/hugging-face-datasets/SKILL.md +542 -0
  44. package/data/skills/hugging-face-evaluation/SKILL.md +651 -0
  45. package/data/skills/hugging-face-jobs/SKILL.md +1042 -0
  46. package/data/skills/hugging-face-model-trainer/SKILL.md +717 -0
  47. package/data/skills/hugging-face-paper-pages/SKILL.md +239 -0
  48. package/data/skills/hugging-face-paper-publisher/SKILL.md +624 -0
  49. package/data/skills/hugging-face-tool-builder/SKILL.md +110 -0
  50. package/data/skills/hugging-face-trackio/SKILL.md +115 -0
  51. package/data/skills/hugging-face-vision-trainer/SKILL.md +593 -0
  52. package/data/skills/huggingface-gradio/SKILL.md +245 -0
  53. package/data/skills/matlab/SKILL.md +376 -0
  54. package/data/skills/modal/SKILL.md +381 -0
  55. package/data/skills/openai-cloudflare-deploy/SKILL.md +224 -0
  56. package/data/skills/openai-develop-web-game/SKILL.md +149 -0
  57. package/data/skills/openai-doc/SKILL.md +80 -0
  58. package/data/skills/openai-figma/SKILL.md +42 -0
  59. package/data/skills/openai-figma-implement-design/SKILL.md +264 -0
  60. package/data/skills/openai-gh-address-comments/SKILL.md +25 -0
  61. package/data/skills/openai-gh-fix-ci/SKILL.md +69 -0
  62. package/data/skills/openai-imagegen/SKILL.md +174 -0
  63. package/data/skills/openai-jupyter-notebook/SKILL.md +107 -0
  64. package/data/skills/openai-linear/SKILL.md +87 -0
  65. package/data/skills/openai-netlify-deploy/SKILL.md +247 -0
  66. package/data/skills/openai-notion-knowledge-capture/SKILL.md +56 -0
  67. package/data/skills/openai-notion-meeting-intelligence/SKILL.md +60 -0
  68. package/data/skills/openai-notion-research-documentation/SKILL.md +59 -0
  69. package/data/skills/openai-notion-spec-to-implementation/SKILL.md +58 -0
  70. package/data/skills/openai-openai-docs/SKILL.md +69 -0
  71. package/data/skills/openai-pdf/SKILL.md +67 -0
  72. package/data/skills/openai-playwright/SKILL.md +147 -0
  73. package/data/skills/openai-render-deploy/SKILL.md +479 -0
  74. package/data/skills/openai-screenshot/SKILL.md +267 -0
  75. package/data/skills/openai-security-best-practices/SKILL.md +86 -0
  76. package/data/skills/openai-security-ownership-map/SKILL.md +206 -0
  77. package/data/skills/openai-security-threat-model/SKILL.md +81 -0
  78. package/data/skills/openai-sentry/SKILL.md +123 -0
  79. package/data/skills/openai-sora/SKILL.md +178 -0
  80. package/data/skills/openai-speech/SKILL.md +144 -0
  81. package/data/skills/openai-spreadsheet/SKILL.md +145 -0
  82. package/data/skills/openai-transcribe/SKILL.md +81 -0
  83. package/data/skills/openai-vercel-deploy/SKILL.md +77 -0
  84. package/data/skills/openai-yeet/SKILL.md +28 -0
  85. package/data/skills/pennylane/SKILL.md +224 -0
  86. package/data/skills/polars-bio/SKILL.md +374 -0
  87. package/data/skills/primekg/SKILL.md +97 -0
  88. package/data/skills/pymatgen/SKILL.md +689 -0
  89. package/data/skills/qiskit/SKILL.md +273 -0
  90. package/data/skills/qutip/SKILL.md +316 -0
  91. package/data/skills/recursive-decomposition/SKILL.md +185 -0
  92. package/data/skills/rowan/SKILL.md +427 -0
  93. package/data/skills/scholar-evaluation/SKILL.md +298 -0
  94. package/data/skills/sentry-create-alert/SKILL.md +210 -0
  95. package/data/skills/sentry-fix-issues/SKILL.md +126 -0
  96. package/data/skills/sentry-pr-code-review/SKILL.md +105 -0
  97. package/data/skills/sentry-python-sdk/SKILL.md +317 -0
  98. package/data/skills/sentry-setup-ai-monitoring/SKILL.md +217 -0
  99. package/data/skills/stable-baselines3/SKILL.md +297 -0
  100. package/data/skills/sympy/SKILL.md +498 -0
  101. package/data/skills/trailofbits-ask-questions-if-underspecified/SKILL.md +85 -0
  102. package/data/skills/trailofbits-audit-context-building/SKILL.md +302 -0
  103. package/data/skills/trailofbits-differential-review/SKILL.md +220 -0
  104. package/data/skills/trailofbits-insecure-defaults/SKILL.md +117 -0
  105. package/data/skills/trailofbits-modern-python/SKILL.md +333 -0
  106. package/data/skills/trailofbits-property-based-testing/SKILL.md +123 -0
  107. package/data/skills/trailofbits-semgrep-rule-creator/SKILL.md +172 -0
  108. package/data/skills/trailofbits-sharp-edges/SKILL.md +292 -0
  109. package/data/skills/trailofbits-variant-analysis/SKILL.md +142 -0
  110. package/data/skills/transformers.js/SKILL.md +637 -0
  111. package/data/skills/writing/SKILL.md +419 -0
  112. package/dist/bgi.js +66 -2
  113. package/package.json +1 -1
@@ -0,0 +1,374 @@
1
+ ---
2
+ name: polars-bio
3
+ description: High-performance genomic interval operations and bioinformatics file I/O on Polars DataFrames. Overlap, nearest, merge, coverage, complement, subtract for BED/VCF/BAM/GFF intervals. Streaming, cloud-native, faster bioframe alternative.
4
+ license: https://github.com/biodatageeks/polars-bio/blob/main/LICENSE
5
+ metadata:
6
+ skill-author: K-Dense Inc.
7
+ ---
8
+
9
+ # polars-bio
10
+
11
+ ## Overview
12
+
13
+ polars-bio is a high-performance Python library for genomic interval operations and bioinformatics file I/O, built on Polars, Apache Arrow, and Apache DataFusion. It provides a familiar DataFrame-centric API for interval arithmetic (overlap, nearest, merge, coverage, complement, subtract) and reading/writing common bioinformatics formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ).
14
+
15
+ Key value propositions:
16
+ - **6-38x faster** than bioframe on real-world genomic benchmarks
17
+ - **Streaming/out-of-core** support for large genomes via DataFusion
18
+ - **Cloud-native** file I/O (S3, GCS, Azure) with predicate pushdown
19
+ - **Two API styles**: functional (`pb.overlap(df1, df2)`) and method-chaining (`df1.lazy().pb.overlap(df2)`)
20
+ - **SQL interface** for genomic data via DataFusion SQL engine
21
+
22
+ ## When to Use This Skill
23
+
24
+ Use this skill when:
25
+ - Performing genomic interval operations (overlap, nearest, merge, coverage, complement, subtract)
26
+ - Reading/writing bioinformatics file formats (BED, VCF, BAM, CRAM, GFF/GTF, FASTA, FASTQ)
27
+ - Processing large genomic datasets that don't fit in memory (streaming mode)
28
+ - Running SQL queries on genomic data files
29
+ - Migrating from bioframe to a faster alternative
30
+ - Computing read depth/pileup from BAM/CRAM files
31
+ - Working with Polars DataFrames containing genomic intervals
32
+
33
+ ## Quick Start
34
+
35
+ ### Installation
36
+
37
+ ```bash
38
+ pip install polars-bio
39
+ # or
40
+ uv pip install polars-bio
41
+ ```
42
+
43
+ For pandas compatibility:
44
+ ```bash
45
+ pip install polars-bio[pandas]
46
+ ```
47
+
48
+ ### Basic Overlap Example
49
+
50
+ ```python
51
+ import polars as pl
52
+ import polars_bio as pb
53
+
54
+ # Create two interval DataFrames
55
+ df1 = pl.DataFrame({
56
+ "chrom": ["chr1", "chr1", "chr1"],
57
+ "start": [1, 5, 22],
58
+ "end": [6, 9, 30],
59
+ })
60
+
61
+ df2 = pl.DataFrame({
62
+ "chrom": ["chr1", "chr1"],
63
+ "start": [3, 25],
64
+ "end": [8, 28],
65
+ })
66
+
67
+ # Functional API (returns LazyFrame by default)
68
+ result = pb.overlap(df1, df2)
69
+ result_df = result.collect()
70
+
71
+ # Get a DataFrame directly
72
+ result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")
73
+
74
+ # Method-chaining API (via .pb accessor on LazyFrame)
75
+ result = df1.lazy().pb.overlap(df2)
76
+ result_df = result.collect()
77
+ ```
78
+
79
+ ### Reading a BED File
80
+
81
+ ```python
82
+ import polars_bio as pb
83
+
84
+ # Eager read (loads entire file)
85
+ df = pb.read_bed("regions.bed")
86
+
87
+ # Lazy scan (streaming, for large files)
88
+ lf = pb.scan_bed("regions.bed")
89
+ result = lf.collect()
90
+ ```
91
+
92
+ ## Core Capabilities
93
+
94
+ ### 1. Genomic Interval Operations
95
+
96
+ polars-bio provides 8 core interval operations for genomic range arithmetic. All operations accept Polars DataFrames with `chrom`, `start`, `end` columns (configurable). All operations return a `LazyFrame` by default (use `output_type="polars.DataFrame"` for eager results).
97
+
98
+ **Operations:**
99
+ - `overlap` / `count_overlaps` - Find or count overlapping intervals between two sets
100
+ - `nearest` - Find nearest intervals (with configurable `k`, `overlap`, `distance` params)
101
+ - `merge` - Merge overlapping/bookended intervals within a set
102
+ - `cluster` - Assign cluster IDs to overlapping intervals
103
+ - `coverage` - Compute per-interval coverage counts (two-input operation)
104
+ - `complement` - Find gaps between intervals within a genome
105
+ - `subtract` - Remove portions of intervals that overlap another set
106
+
107
+ **Example:**
108
+ ```python
109
+ import polars_bio as pb
110
+
111
+ # Find overlapping intervals (returns LazyFrame)
112
+ result = pb.overlap(df1, df2, suffixes=("_1", "_2"))
113
+
114
+ # Count overlaps per interval
115
+ counts = pb.count_overlaps(df1, df2)
116
+
117
+ # Merge overlapping intervals
118
+ merged = pb.merge(df1)
119
+
120
+ # Find nearest intervals
121
+ nearest = pb.nearest(df1, df2)
122
+
123
+ # Collect any LazyFrame result to DataFrame
124
+ result_df = result.collect()
125
+ ```
126
+
127
+ **Reference:** See `references/interval_operations.md` for detailed documentation on all operations, parameters, output schemas, and performance considerations.
128
+
129
+ ### 2. Bioinformatics File I/O
130
+
131
+ Read and write common bioinformatics formats with `read_*`, `scan_*`, `write_*`, and `sink_*` functions. Supports cloud storage (S3, GCS, Azure) and compression (GZIP, BGZF).
132
+
133
+ **Supported formats:**
134
+ - **BED** - Genomic intervals (`read_bed`, `scan_bed`, `write_*` via generic)
135
+ - **VCF** - Genetic variants (`read_vcf`, `scan_vcf`, `write_vcf`, `sink_vcf`)
136
+ - **BAM** - Aligned reads (`read_bam`, `scan_bam`, `write_bam`, `sink_bam`)
137
+ - **CRAM** - Compressed alignments (`read_cram`, `scan_cram`, `write_cram`, `sink_cram`)
138
+ - **GFF** - Gene annotations (`read_gff`, `scan_gff`)
139
+ - **GTF** - Gene annotations (`read_gtf`, `scan_gtf`)
140
+ - **FASTA** - Reference sequences (`read_fasta`, `scan_fasta`)
141
+ - **FASTQ** - Sequencing reads (`read_fastq`, `scan_fastq`, `write_fastq`, `sink_fastq`)
142
+ - **SAM** - Text alignments (`read_sam`, `scan_sam`, `write_sam`, `sink_sam`)
143
+ - **Hi-C pairs** - Chromatin contacts (`read_pairs`, `scan_pairs`)
144
+
145
+ **Example:**
146
+ ```python
147
+ import polars_bio as pb
148
+
149
+ # Read VCF file
150
+ variants = pb.read_vcf("samples.vcf.gz")
151
+
152
+ # Lazy scan BAM file (streaming)
153
+ alignments = pb.scan_bam("aligned.bam")
154
+
155
+ # Read GFF annotations
156
+ genes = pb.read_gff("annotations.gff3")
157
+
158
+ # Cloud storage (individual params, not a dict)
159
+ df = pb.read_bed("s3://bucket/regions.bed",
160
+ allow_anonymous=True)
161
+ ```
162
+
163
+ **Reference:** See `references/file_io.md` for per-format column schemas, parameters, cloud storage options, and compression support.
164
+
165
+ ### 3. SQL Data Processing
166
+
167
+ Register bioinformatics files as tables and query them using DataFusion SQL. Combines the power of SQL with polars-bio's genomic-aware readers.
168
+
169
+ ```python
170
+ import polars as pl
171
+ import polars_bio as pb
172
+
173
+ # Register files as SQL tables (path first, name= keyword)
174
+ pb.register_vcf("samples.vcf.gz", name="variants")
175
+ pb.register_bed("target_regions.bed", name="regions")
176
+
177
+ # Query with SQL (returns LazyFrame)
178
+ result = pb.sql("SELECT chrom, start, end, ref, alt FROM variants WHERE qual > 30")
179
+ result_df = result.collect()
180
+
181
+ # Register a Polars DataFrame as a SQL table
182
+ pb.from_polars("my_intervals", df)
183
+ result = pb.sql("SELECT * FROM my_intervals WHERE chrom = 'chr1'").collect()
184
+ ```
185
+
186
+ **Reference:** See `references/sql_processing.md` for register functions, SQL syntax, and examples.
187
+
188
+ ### 4. Pileup Operations
189
+
190
+ Compute per-base read depth from BAM/CRAM files with CIGAR-aware depth calculation.
191
+
192
+ ```python
193
+ import polars_bio as pb
194
+
195
+ # Compute depth across a BAM file
196
+ depth_lf = pb.depth("aligned.bam")
197
+ depth_df = depth_lf.collect()
198
+
199
+ # With quality filter
200
+ depth_lf = pb.depth("aligned.bam", min_mapping_quality=20)
201
+ ```
202
+
203
+ **Reference:** See `references/pileup_operations.md` for parameters and integration patterns.
204
+
205
+ ## Key Concepts
206
+
207
+ ### Coordinate Systems
208
+
209
+ polars-bio defaults to **1-based** coordinates (genomic convention). This can be changed globally:
210
+
211
+ ```python
212
+ import polars_bio as pb
213
+
214
+ # Switch to 0-based coordinates
215
+ pb.set_option("coordinate_system", "0-based")
216
+
217
+ # Switch back to 1-based (default)
218
+ pb.set_option("coordinate_system", "1-based")
219
+ ```
220
+
221
+ I/O functions also accept `use_zero_based` to set coordinate metadata on the resulting DataFrame:
222
+
223
+ ```python
224
+ # Read BED with explicit 0-based metadata
225
+ df = pb.read_bed("regions.bed", use_zero_based=True)
226
+ ```
227
+
228
+ **Important:** BED files are always 0-based half-open in the file format. polars-bio handles the conversion automatically when reading BED files. Coordinate metadata is attached to DataFrames by I/O functions and propagated through operations.
229
+
230
+ ### Two API Styles
231
+
232
+ **Functional API** - standalone functions, explicit inputs:
233
+ ```python
234
+ result = pb.overlap(df1, df2, suffixes=("_1", "_2"))
235
+ merged = pb.merge(df)
236
+ ```
237
+
238
+ **Method-chaining API** - via `.pb` accessor on **LazyFrames** (not DataFrames):
239
+ ```python
240
+ result = df1.lazy().pb.overlap(df2)
241
+ merged = df.lazy().pb.merge()
242
+ ```
243
+
244
+ **Important:** The `.pb` accessor for interval operations is only available on `LazyFrame`. On `DataFrame`, `.pb` provides write operations only (`write_bam`, `write_vcf`, etc.).
245
+
246
+ Method-chaining enables fluent pipelines:
247
+ ```python
248
+ # Chain interval operations (note: overlap outputs suffixed columns,
249
+ # so rename before merge which expects chrom/start/end)
250
+ result = (
251
+ df1.lazy()
252
+ .pb.overlap(df2)
253
+ .filter(pl.col("start_2") > 1000)
254
+ .select(
255
+ pl.col("chrom_1").alias("chrom"),
256
+ pl.col("start_1").alias("start"),
257
+ pl.col("end_1").alias("end"),
258
+ )
259
+ .pb.merge()
260
+ .collect()
261
+ )
262
+ ```
263
+
264
+ ### Probe-Build Architecture
265
+
266
+ For two-input operations (overlap, nearest, count_overlaps, coverage), polars-bio uses a probe-build join strategy:
267
+ - The **first** DataFrame is the **probe** (iterated over)
268
+ - The **second** DataFrame is the **build** (indexed for lookup)
269
+
270
+ For best performance, pass the larger DataFrame as the first argument (probe) and the smaller one as the second (build).
271
+
272
+ ### Column Conventions
273
+
274
+ By default, polars-bio expects columns named `chrom`, `start`, `end`. Custom column names can be specified via lists:
275
+
276
+ ```python
277
+ result = pb.overlap(
278
+ df1, df2,
279
+ cols1=["chromosome", "begin", "finish"],
280
+ cols2=["chr", "pos_start", "pos_end"],
281
+ )
282
+ ```
283
+
284
+ ### Return Types and Collecting Results
285
+
286
+ All interval operations and `pb.sql()` return a **LazyFrame** by default. Use `.collect()` to materialize results, or pass `output_type="polars.DataFrame"` for eager evaluation:
287
+
288
+ ```python
289
+ # Lazy (default) - collect when needed
290
+ result_lf = pb.overlap(df1, df2)
291
+ result_df = result_lf.collect()
292
+
293
+ # Eager - get DataFrame directly
294
+ result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")
295
+ ```
296
+
297
+ ### Streaming and Out-of-Core Processing
298
+
299
+ For datasets larger than available RAM, use `scan_*` functions and streaming execution:
300
+
301
+ ```python
302
+ # Scan files lazily
303
+ lf = pb.scan_bed("large_intervals.bed")
304
+
305
+ # Process with streaming
306
+ result = lf.collect(streaming=True)
307
+ ```
308
+
309
+ DataFusion streaming is enabled by default for interval operations, processing data in batches without loading the full dataset into memory.
310
+
311
+ ## Common Pitfalls
312
+
313
+ 1. **`.pb` accessor on DataFrame vs LazyFrame:** Interval operations (overlap, merge, etc.) are only on `LazyFrame.pb`. `DataFrame.pb` only has write methods. Use `.lazy()` to convert before chaining interval ops.
314
+
315
+ 2. **LazyFrame returns:** All interval operations and `pb.sql()` return `LazyFrame` by default. Don't forget `.collect()` or use `output_type="polars.DataFrame"`.
316
+
317
+ 3. **Column name mismatches:** polars-bio expects `chrom`, `start`, `end` by default. Use `cols1`/`cols2` parameters (as lists) if your columns have different names.
318
+
319
+ 4. **Coordinate system metadata:** When constructing DataFrames manually (not via `read_*`/`scan_*`), polars-bio warns about missing coordinate metadata. Use `pb.set_option("coordinate_system", "0-based")` globally, or use I/O functions that set metadata automatically.
320
+
321
+ 5. **Probe-build order matters:** For overlap, nearest, and coverage, the first DataFrame is probed against the second. Swapping arguments changes which intervals appear in the left vs right output columns, and can affect performance.
322
+
323
+ 6. **INT32 position limit:** Genomic positions are stored as 32-bit integers, limiting coordinates to ~2.1 billion. This is sufficient for all known genomes but may be an issue with custom coordinate spaces.
324
+
325
+ 7. **BAM index requirements:** `read_bam` and `scan_bam` require a `.bai` index file alongside the BAM. Create one with `samtools index` if missing.
326
+
327
+ 8. **Parallel execution disabled by default:** DataFusion parallelism defaults to 1 partition. Enable for large datasets:
328
+ ```python
329
+ pb.set_option("datafusion.execution.target_partitions", 8)
330
+ ```
331
+
332
+ 9. **CRAM has separate functions:** Use `read_cram`/`scan_cram`/`register_cram` for CRAM files (not `read_bam`). CRAM functions require a `reference_path` parameter.
333
+
334
+ ## Best Practices
335
+
336
+ 1. **Use `scan_*` for large files:** Prefer `scan_bed`, `scan_vcf`, etc. over `read_*` for files larger than available RAM. Scan functions enable streaming and predicate pushdown.
337
+
338
+ 2. **Configure parallelism for large datasets:**
339
+ ```python
340
+ import os
341
+ pb.set_option("datafusion.execution.target_partitions", os.cpu_count())
342
+ ```
343
+
344
+ 3. **Use BGZF compression:** BGZF-compressed files (`.bed.gz`, `.vcf.gz`) support parallel block decompression, significantly faster than plain GZIP.
345
+
346
+ 4. **Select columns early:** When only specific columns are needed, select them early to reduce memory usage:
347
+ ```python
348
+ df = pb.read_vcf("large.vcf.gz").select("chrom", "start", "end", "ref", "alt")
349
+ ```
350
+
351
+ 5. **Use cloud paths directly:** Pass S3/GCS/Azure URIs directly to read/scan functions instead of downloading files first:
352
+ ```python
353
+ df = pb.read_bed("s3://my-bucket/regions.bed", allow_anonymous=True)
354
+ ```
355
+
356
+ 6. **Prefer functional API for single operations, method-chaining for pipelines:** Use `pb.overlap()` for one-off operations and `.lazy().pb.overlap()` when building multi-step pipelines.
357
+
358
+ ## Resources
359
+
360
+ ### references/
361
+
362
+ Detailed documentation for each major capability:
363
+
364
+ - **interval_operations.md** - All 8 interval operations with parameters, examples, output schemas, and performance tips. Core reference for genomic range arithmetic.
365
+
366
+ - **file_io.md** - Supported formats table, per-format column schemas, cloud storage configuration, compression support, and common parameters.
367
+
368
+ - **sql_processing.md** - Register functions, DataFusion SQL syntax, combining SQL with interval operations, and example queries.
369
+
370
+ - **pileup_operations.md** - Per-base read depth computation from BAM/CRAM files, parameters, and integration with interval operations.
371
+
372
+ - **configuration.md** - Global settings (parallelism, coordinate systems, streaming modes), logging, and metadata management.
373
+
374
+ - **bioframe_migration.md** - Operation mapping table, API differences, performance comparison, migration code examples, and pandas compatibility mode.
@@ -0,0 +1,97 @@
1
+ ---
2
+ name: primekg
3
+ description: Query the Precision Medicine Knowledge Graph (PrimeKG) for multiscale biological data including genes, drugs, diseases, phenotypes, and more.
4
+ license: Unknown
5
+ metadata:
6
+ skill-author: K-Dense Inc. (PrimeKG original from Harvard MIMS)
7
+ ---
8
+
9
+ # PrimeKG Knowledge Graph Skill
10
+
11
+ ## Overview
12
+
13
+ PrimeKG is a precision medicine knowledge graph that integrates over 20 primary databases and high-quality scientific literature into a single resource. It contains over 100,000 nodes and 4 million edges across 29 relationship types, including drug-target, disease-gene, and phenotype-disease associations.
14
+
15
+ **Key capabilities:**
16
+ - Search for nodes (genes, proteins, drugs, diseases, phenotypes)
17
+ - Retrieve direct neighbors (associated entities and clinical evidence)
18
+ - Analyze local disease context (related genes, drugs, phenotypes)
19
+ - Identify drug-disease paths (potential repurposing opportunities)
20
+
21
+ **Data access:** Programmatic access via `query_primekg.py`. Data is stored at `C:\Users\eamon\Documents\Data\PrimeKG\kg.csv`.
22
+
23
+ ## When to Use This Skill
24
+
25
+ This skill should be used when:
26
+
27
+ - **Knowledge-based drug discovery:** Identifying targets and mechanisms for diseases.
28
+ - **Drug repurposing:** Finding existing drugs that might have evidence for new indications.
29
+ - **Phenotype analysis:** Understanding how symptoms/phenotypes relate to diseases and genes.
30
+ - **Multiscale biology:** Bridging the gap between molecular targets (genes) and clinical outcomes (diseases).
31
+ - **Network pharmacology:** Investigating the broader network effects of drug-target interactions.
32
+
33
+ ## Core Workflow
34
+
35
+ ### 1. Search for Entities
36
+
37
+ Find identifiers for genes, drugs, or diseases.
38
+
39
+ ```python
40
+ from scripts.query_primekg import search_nodes
41
+
42
+ # Search for Alzheimer's disease nodes
43
+ results = search_nodes("Alzheimer", node_type="disease")
44
+ # Returns: [{"id": "EFO_0000249", "type": "disease", "name": "Alzheimer's disease", ...}]
45
+ ```
46
+
47
+ ### 2. Get Neighbors (Direct Associations)
48
+
49
+ Retrieve all connected nodes and relationship types.
50
+
51
+ ```python
52
+ from scripts.query_primekg import get_neighbors
53
+
54
+ # Get all neighbors of a specific disease ID
55
+ neighbors = get_neighbors("EFO_0000249")
56
+ # Returns: List of neighbors like {"neighbor_name": "APOE", "relation": "disease_gene", ...}
57
+ ```
58
+
59
+ ### 3. Analyze Disease Context
60
+
61
+ A high-level function to summarize associations for a disease.
62
+
63
+ ```python
64
+ from scripts.query_primekg import get_disease_context
65
+
66
+ # Comprehensive summary for a disease
67
+ context = get_disease_context("Alzheimer's disease")
68
+ # Access: context['associated_genes'], context['associated_drugs'], context['phenotypes']
69
+ ```
70
+
71
+ ## Relationship Types in PrimeKG
72
+
73
+ The graph contains several key relationship types including:
74
+ - `protein_protein`: Physical PPIs
75
+ - `drug_protein`: Drug target/mechanism associations
76
+ - `disease_gene`: Genetic associations
77
+ - `drug_disease`: Indications and contraindications
78
+ - `disease_phenotype`: Clinical signs and symptoms
79
+ - `gwas`: Genome-wide association studies evidence
80
+
81
+ ## Best Practices
82
+
83
+ 1. **Use specific IDs:** When using `get_neighbors`, ensure you have the correct ID from `search_nodes`.
84
+ 2. **Context first:** Use `get_disease_context` for a broad overview before diving into specific genes or drugs.
85
+ 3. **Filter relationships:** Use the `relation_type` filter in `get_neighbors` to focus on specific evidence (e.g., only `drug_protein`).
86
+ 4. **Multiscale integration:** Combine with `OpenTargets` for deeper genetic evidence or `Semantic Scholar` for the latest literature context.
87
+
88
+ ## Resources
89
+
90
+ ### Scripts
91
+ - `scripts/query_primekg.py`: Core functions for searching and querying the knowledge graph.
92
+
93
+ ### Data Path
94
+ - Data: `/mnt/c/Users/eamon/Documents/Data/PrimeKG/kg.csv`
95
+ - Total nodes: ~129,000
96
+ - Total edges: ~4,000,000
97
+ - Database: CSV-based, optimized for pandas querying.