opencode-skills-antigravity 1.0.40 → 1.0.41

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (84) hide show
  1. package/bundled-skills/.antigravity-install-manifest.json +7 -1
  2. package/bundled-skills/docs/integrations/jetski-cortex.md +3 -3
  3. package/bundled-skills/docs/integrations/jetski-gemini-loader/README.md +1 -1
  4. package/bundled-skills/docs/maintainers/repo-growth-seo.md +3 -3
  5. package/bundled-skills/docs/maintainers/skills-update-guide.md +1 -1
  6. package/bundled-skills/docs/sources/sources.md +2 -2
  7. package/bundled-skills/docs/users/bundles.md +1 -1
  8. package/bundled-skills/docs/users/claude-code-skills.md +1 -1
  9. package/bundled-skills/docs/users/gemini-cli-skills.md +1 -1
  10. package/bundled-skills/docs/users/getting-started.md +1 -1
  11. package/bundled-skills/docs/users/kiro-integration.md +1 -1
  12. package/bundled-skills/docs/users/usage.md +4 -4
  13. package/bundled-skills/docs/users/visual-guide.md +4 -4
  14. package/bundled-skills/hugging-face-cli/SKILL.md +192 -195
  15. package/bundled-skills/hugging-face-community-evals/SKILL.md +213 -0
  16. package/bundled-skills/hugging-face-community-evals/examples/.env.example +3 -0
  17. package/bundled-skills/hugging-face-community-evals/examples/USAGE_EXAMPLES.md +101 -0
  18. package/bundled-skills/hugging-face-community-evals/scripts/inspect_eval_uv.py +104 -0
  19. package/bundled-skills/hugging-face-community-evals/scripts/inspect_vllm_uv.py +306 -0
  20. package/bundled-skills/hugging-face-community-evals/scripts/lighteval_vllm_uv.py +297 -0
  21. package/bundled-skills/hugging-face-dataset-viewer/SKILL.md +120 -120
  22. package/bundled-skills/hugging-face-gradio/SKILL.md +304 -0
  23. package/bundled-skills/hugging-face-gradio/examples.md +613 -0
  24. package/bundled-skills/hugging-face-jobs/SKILL.md +25 -18
  25. package/bundled-skills/hugging-face-jobs/index.html +216 -0
  26. package/bundled-skills/hugging-face-jobs/references/hardware_guide.md +336 -0
  27. package/bundled-skills/hugging-face-jobs/references/hub_saving.md +352 -0
  28. package/bundled-skills/hugging-face-jobs/references/token_usage.md +570 -0
  29. package/bundled-skills/hugging-face-jobs/references/troubleshooting.md +475 -0
  30. package/bundled-skills/hugging-face-jobs/scripts/cot-self-instruct.py +718 -0
  31. package/bundled-skills/hugging-face-jobs/scripts/finepdfs-stats.py +546 -0
  32. package/bundled-skills/hugging-face-jobs/scripts/generate-responses.py +587 -0
  33. package/bundled-skills/hugging-face-model-trainer/SKILL.md +11 -12
  34. package/bundled-skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  35. package/bundled-skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  36. package/bundled-skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  37. package/bundled-skills/hugging-face-model-trainer/references/local_training_macos.md +231 -0
  38. package/bundled-skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  39. package/bundled-skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  40. package/bundled-skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  41. package/bundled-skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  42. package/bundled-skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
  43. package/bundled-skills/hugging-face-model-trainer/references/unsloth.md +313 -0
  44. package/bundled-skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +424 -0
  45. package/bundled-skills/hugging-face-model-trainer/scripts/dataset_inspector.py +417 -0
  46. package/bundled-skills/hugging-face-model-trainer/scripts/estimate_cost.py +150 -0
  47. package/bundled-skills/hugging-face-model-trainer/scripts/train_dpo_example.py +106 -0
  48. package/bundled-skills/hugging-face-model-trainer/scripts/train_grpo_example.py +89 -0
  49. package/bundled-skills/hugging-face-model-trainer/scripts/train_sft_example.py +122 -0
  50. package/bundled-skills/hugging-face-model-trainer/scripts/unsloth_sft_example.py +512 -0
  51. package/bundled-skills/hugging-face-paper-publisher/SKILL.md +11 -4
  52. package/bundled-skills/hugging-face-paper-publisher/examples/example_usage.md +326 -0
  53. package/bundled-skills/hugging-face-paper-publisher/references/quick_reference.md +216 -0
  54. package/bundled-skills/hugging-face-paper-publisher/scripts/paper_manager.py +606 -0
  55. package/bundled-skills/hugging-face-paper-publisher/templates/arxiv.md +299 -0
  56. package/bundled-skills/hugging-face-paper-publisher/templates/ml-report.md +358 -0
  57. package/bundled-skills/hugging-face-paper-publisher/templates/modern.md +319 -0
  58. package/bundled-skills/hugging-face-paper-publisher/templates/standard.md +201 -0
  59. package/bundled-skills/hugging-face-papers/SKILL.md +241 -0
  60. package/bundled-skills/hugging-face-trackio/.claude-plugin/plugin.json +19 -0
  61. package/bundled-skills/hugging-face-trackio/SKILL.md +117 -0
  62. package/bundled-skills/hugging-face-trackio/references/alerts.md +196 -0
  63. package/bundled-skills/hugging-face-trackio/references/logging_metrics.md +206 -0
  64. package/bundled-skills/hugging-face-trackio/references/retrieving_metrics.md +251 -0
  65. package/bundled-skills/hugging-face-vision-trainer/SKILL.md +595 -0
  66. package/bundled-skills/hugging-face-vision-trainer/references/finetune_sam2_trainer.md +254 -0
  67. package/bundled-skills/hugging-face-vision-trainer/references/hub_saving.md +618 -0
  68. package/bundled-skills/hugging-face-vision-trainer/references/image_classification_training_notebook.md +279 -0
  69. package/bundled-skills/hugging-face-vision-trainer/references/object_detection_training_notebook.md +700 -0
  70. package/bundled-skills/hugging-face-vision-trainer/references/reliability_principles.md +310 -0
  71. package/bundled-skills/hugging-face-vision-trainer/references/timm_trainer.md +91 -0
  72. package/bundled-skills/hugging-face-vision-trainer/scripts/dataset_inspector.py +814 -0
  73. package/bundled-skills/hugging-face-vision-trainer/scripts/estimate_cost.py +217 -0
  74. package/bundled-skills/hugging-face-vision-trainer/scripts/image_classification_training.py +383 -0
  75. package/bundled-skills/hugging-face-vision-trainer/scripts/object_detection_training.py +710 -0
  76. package/bundled-skills/hugging-face-vision-trainer/scripts/sam_segmentation_training.py +382 -0
  77. package/bundled-skills/transformers-js/SKILL.md +639 -0
  78. package/bundled-skills/transformers-js/references/CACHE.md +339 -0
  79. package/bundled-skills/transformers-js/references/CONFIGURATION.md +390 -0
  80. package/bundled-skills/transformers-js/references/EXAMPLES.md +605 -0
  81. package/bundled-skills/transformers-js/references/MODEL_ARCHITECTURES.md +167 -0
  82. package/bundled-skills/transformers-js/references/PIPELINE_OPTIONS.md +545 -0
  83. package/bundled-skills/transformers-js/references/TEXT_GENERATION.md +315 -0
  84. package/package.json +1 -1
@@ -0,0 +1,297 @@
1
+ # /// script
2
+ # requires-python = ">=3.10"
3
+ # dependencies = [
4
+ # "lighteval[accelerate,vllm]>=0.6.0",
5
+ # "torch>=2.0.0",
6
+ # "transformers>=4.40.0",
7
+ # "accelerate>=0.30.0",
8
+ # "vllm>=0.4.0",
9
+ # ]
10
+ # ///
11
+
12
+ """
13
+ Entry point script for running lighteval evaluations with local GPU backends.
14
+
15
+ This script runs evaluations using vLLM or accelerate on custom HuggingFace models.
16
+ It is separate from inference provider scripts and evaluates models directly on local hardware.
17
+
18
+ Usage (standalone):
19
+ uv run scripts/lighteval_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5"
20
+
21
+ """
22
+
23
+ from __future__ import annotations
24
+
25
+ import argparse
26
+ import os
27
+ import subprocess
28
+ import sys
29
+ from typing import Optional
30
+
31
+
32
+ def setup_environment() -> None:
33
+ """Configure environment variables for HuggingFace authentication."""
34
+ hf_token = os.getenv("HF_TOKEN")
35
+ if hf_token:
36
+ os.environ.setdefault("HUGGING_FACE_HUB_TOKEN", hf_token)
37
+ os.environ.setdefault("HF_HUB_TOKEN", hf_token)
38
+
39
+
40
+ def run_lighteval_vllm(
41
+ model_id: str,
42
+ tasks: str,
43
+ output_dir: Optional[str] = None,
44
+ max_samples: Optional[int] = None,
45
+ batch_size: int = 1,
46
+ tensor_parallel_size: int = 1,
47
+ gpu_memory_utilization: float = 0.8,
48
+ dtype: str = "auto",
49
+ trust_remote_code: bool = False,
50
+ use_chat_template: bool = False,
51
+ system_prompt: Optional[str] = None,
52
+ ) -> None:
53
+ """
54
+ Run lighteval with vLLM backend for efficient GPU inference.
55
+
56
+ Args:
57
+ model_id: HuggingFace model ID (e.g., "meta-llama/Llama-3.2-1B")
58
+ tasks: Task specification (e.g., "leaderboard|mmlu|5" or "lighteval|hellaswag|0")
59
+ output_dir: Directory for evaluation results
60
+ max_samples: Limit number of samples per task
61
+ batch_size: Batch size for evaluation
62
+ tensor_parallel_size: Number of GPUs for tensor parallelism
63
+ gpu_memory_utilization: GPU memory fraction to use (0.0-1.0)
64
+ dtype: Data type for model weights (auto, float16, bfloat16)
65
+ trust_remote_code: Allow executing remote code from model repo
66
+ use_chat_template: Apply chat template for conversational models
67
+ system_prompt: System prompt for chat models
68
+ """
69
+ setup_environment()
70
+
71
+ # Build lighteval vllm command
72
+ cmd = [
73
+ "lighteval",
74
+ "vllm",
75
+ model_id,
76
+ tasks,
77
+ "--batch-size", str(batch_size),
78
+ "--tensor-parallel-size", str(tensor_parallel_size),
79
+ "--gpu-memory-utilization", str(gpu_memory_utilization),
80
+ "--dtype", dtype,
81
+ ]
82
+
83
+ if output_dir:
84
+ cmd.extend(["--output-dir", output_dir])
85
+
86
+ if max_samples:
87
+ cmd.extend(["--max-samples", str(max_samples)])
88
+
89
+ if trust_remote_code:
90
+ cmd.append("--trust-remote-code")
91
+
92
+ if use_chat_template:
93
+ cmd.append("--use-chat-template")
94
+
95
+ if system_prompt:
96
+ cmd.extend(["--system-prompt", system_prompt])
97
+
98
+ print(f"Running: {' '.join(cmd)}")
99
+
100
+ try:
101
+ subprocess.run(cmd, check=True)
102
+ print("Evaluation complete.")
103
+ except subprocess.CalledProcessError as exc:
104
+ print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
105
+ sys.exit(exc.returncode)
106
+
107
+
108
+ def run_lighteval_accelerate(
109
+ model_id: str,
110
+ tasks: str,
111
+ output_dir: Optional[str] = None,
112
+ max_samples: Optional[int] = None,
113
+ batch_size: int = 1,
114
+ dtype: str = "bfloat16",
115
+ trust_remote_code: bool = False,
116
+ use_chat_template: bool = False,
117
+ system_prompt: Optional[str] = None,
118
+ ) -> None:
119
+ """
120
+ Run lighteval with accelerate backend for multi-GPU distributed inference.
121
+
122
+ Use this backend when vLLM is not available or for models not supported by vLLM.
123
+
124
+ Args:
125
+ model_id: HuggingFace model ID
126
+ tasks: Task specification
127
+ output_dir: Directory for evaluation results
128
+ max_samples: Limit number of samples per task
129
+ batch_size: Batch size for evaluation
130
+ dtype: Data type for model weights
131
+ trust_remote_code: Allow executing remote code
132
+ use_chat_template: Apply chat template
133
+ system_prompt: System prompt for chat models
134
+ """
135
+ setup_environment()
136
+
137
+ # Build lighteval accelerate command
138
+ cmd = [
139
+ "lighteval",
140
+ "accelerate",
141
+ model_id,
142
+ tasks,
143
+ "--batch-size", str(batch_size),
144
+ "--dtype", dtype,
145
+ ]
146
+
147
+ if output_dir:
148
+ cmd.extend(["--output-dir", output_dir])
149
+
150
+ if max_samples:
151
+ cmd.extend(["--max-samples", str(max_samples)])
152
+
153
+ if trust_remote_code:
154
+ cmd.append("--trust-remote-code")
155
+
156
+ if use_chat_template:
157
+ cmd.append("--use-chat-template")
158
+
159
+ if system_prompt:
160
+ cmd.extend(["--system-prompt", system_prompt])
161
+
162
+ print(f"Running: {' '.join(cmd)}")
163
+
164
+ try:
165
+ subprocess.run(cmd, check=True)
166
+ print("Evaluation complete.")
167
+ except subprocess.CalledProcessError as exc:
168
+ print(f"Evaluation failed with exit code {exc.returncode}", file=sys.stderr)
169
+ sys.exit(exc.returncode)
170
+
171
+
172
+ def main() -> None:
173
+ parser = argparse.ArgumentParser(
174
+ description="Run lighteval evaluations with vLLM or accelerate backend on custom HuggingFace models",
175
+ formatter_class=argparse.RawDescriptionHelpFormatter,
176
+ epilog="""
177
+ Examples:
178
+ # Run MMLU evaluation with vLLM
179
+ uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5"
180
+
181
+ # Run with accelerate backend instead of vLLM
182
+ uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --backend accelerate
183
+
184
+ # Run with chat template for instruction-tuned models
185
+ uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B-Instruct --tasks "leaderboard|mmlu|5" --use-chat-template
186
+
187
+ # Run with limited samples for testing
188
+ uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --max-samples 10
189
+
190
+ Task format:
191
+ Tasks use the format: "suite|task|num_fewshot"
192
+ - leaderboard|mmlu|5 (MMLU with 5-shot)
193
+ - lighteval|hellaswag|0 (HellaSwag zero-shot)
194
+ - leaderboard|gsm8k|5 (GSM8K with 5-shot)
195
+ - Multiple tasks: "leaderboard|mmlu|5,leaderboard|gsm8k|5"
196
+ """,
197
+ )
198
+
199
+ parser.add_argument(
200
+ "--model",
201
+ required=True,
202
+ help="HuggingFace model ID (e.g., meta-llama/Llama-3.2-1B)",
203
+ )
204
+ parser.add_argument(
205
+ "--tasks",
206
+ required=True,
207
+ help="Task specification (e.g., 'leaderboard|mmlu|5')",
208
+ )
209
+ parser.add_argument(
210
+ "--backend",
211
+ choices=["vllm", "accelerate"],
212
+ default="vllm",
213
+ help="Inference backend to use (default: vllm)",
214
+ )
215
+ parser.add_argument(
216
+ "--output-dir",
217
+ default=None,
218
+ help="Directory for evaluation results",
219
+ )
220
+ parser.add_argument(
221
+ "--max-samples",
222
+ type=int,
223
+ default=None,
224
+ help="Limit number of samples per task (useful for testing)",
225
+ )
226
+ parser.add_argument(
227
+ "--batch-size",
228
+ type=int,
229
+ default=1,
230
+ help="Batch size for evaluation (default: 1)",
231
+ )
232
+ parser.add_argument(
233
+ "--tensor-parallel-size",
234
+ type=int,
235
+ default=1,
236
+ help="Number of GPUs for tensor parallelism (vLLM only, default: 1)",
237
+ )
238
+ parser.add_argument(
239
+ "--gpu-memory-utilization",
240
+ type=float,
241
+ default=0.8,
242
+ help="GPU memory fraction to use (vLLM only, default: 0.8)",
243
+ )
244
+ parser.add_argument(
245
+ "--dtype",
246
+ default="auto",
247
+ choices=["auto", "float16", "bfloat16", "float32"],
248
+ help="Data type for model weights (default: auto)",
249
+ )
250
+ parser.add_argument(
251
+ "--trust-remote-code",
252
+ action="store_true",
253
+ help="Allow executing remote code from model repository",
254
+ )
255
+ parser.add_argument(
256
+ "--use-chat-template",
257
+ action="store_true",
258
+ help="Apply chat template for instruction-tuned/chat models",
259
+ )
260
+ parser.add_argument(
261
+ "--system-prompt",
262
+ default=None,
263
+ help="System prompt for chat models",
264
+ )
265
+
266
+ args = parser.parse_args()
267
+
268
+ if args.backend == "vllm":
269
+ run_lighteval_vllm(
270
+ model_id=args.model,
271
+ tasks=args.tasks,
272
+ output_dir=args.output_dir,
273
+ max_samples=args.max_samples,
274
+ batch_size=args.batch_size,
275
+ tensor_parallel_size=args.tensor_parallel_size,
276
+ gpu_memory_utilization=args.gpu_memory_utilization,
277
+ dtype=args.dtype,
278
+ trust_remote_code=args.trust_remote_code,
279
+ use_chat_template=args.use_chat_template,
280
+ system_prompt=args.system_prompt,
281
+ )
282
+ else:
283
+ run_lighteval_accelerate(
284
+ model_id=args.model,
285
+ tasks=args.tasks,
286
+ output_dir=args.output_dir,
287
+ max_samples=args.max_samples,
288
+ batch_size=args.batch_size,
289
+ dtype=args.dtype if args.dtype != "auto" else "bfloat16",
290
+ trust_remote_code=args.trust_remote_code,
291
+ use_chat_template=args.use_chat_template,
292
+ system_prompt=args.system_prompt,
293
+ )
294
+
295
+
296
+ if __name__ == "__main__":
297
+ main()
@@ -1,127 +1,127 @@
1
1
  ---
2
+ source: "https://github.com/huggingface/skills/tree/main/skills/huggingface-datasets"
2
3
  name: hugging-face-dataset-viewer
3
- description: Use this skill for Hugging Face Dataset Viewer API workflows that fetch subset/split metadata, paginate rows, search text, apply filters, download parquet URLs, and read size or statistics.
4
+ description: Query Hugging Face datasets through the Dataset Viewer API for splits, rows, search, filters, and parquet links.
4
5
  risk: unknown
5
- source: community
6
6
  ---
7
-
7
+
8
8
  # Hugging Face Dataset Viewer
9
9
 
10
- Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.
11
-
12
- ## Core workflow
13
-
14
- 1. Optionally validate dataset availability with `/is-valid`.
15
- 2. Resolve `config` + `split` with `/splits`.
16
- 3. Preview with `/first-rows`.
17
- 4. Paginate content with `/rows` using `offset` and `length` (max 100).
18
- 5. Use `/search` for text matching and `/filter` for row predicates.
19
- 6. Retrieve parquet links via `/parquet` and totals/metadata via `/size` and `/statistics`.
20
-
21
- ## Defaults
22
-
23
- - Base URL: `https://datasets-server.huggingface.co`
24
- - Default API method: `GET`
25
- - Query params should be URL-encoded.
26
- - `offset` is 0-based.
27
- - `length` max is usually `100` for row-like endpoints.
28
- - Gated/private datasets require `Authorization: Bearer <HF_TOKEN>`.
29
-
30
- ## Dataset Viewer
31
-
32
- - `Validate dataset`: `/is-valid?dataset=<namespace/repo>`
33
- - `List subsets and splits`: `/splits?dataset=<namespace/repo>`
34
- - `Preview first rows`: `/first-rows?dataset=<namespace/repo>&config=<config>&split=<split>`
35
- - `Paginate rows`: `/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>`
36
- - `Search text`: `/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>`
37
- - `Filter with predicates`: `/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>`
38
- - `List parquet shards`: `/parquet?dataset=<namespace/repo>`
39
- - `Get size totals`: `/size?dataset=<namespace/repo>`
40
- - `Get column statistics`: `/statistics?dataset=<namespace/repo>&config=<config>&split=<split>`
41
- - `Get Croissant metadata (if available)`: `/croissant?dataset=<namespace/repo>`
42
-
43
- Pagination pattern:
44
-
45
- ```bash
46
- curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
47
- curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"
48
- ```
49
-
50
- When pagination is partial, use response fields such as `num_rows_total`, `num_rows_per_page`, and `partial` to drive continuation logic.
51
-
52
- Search/filter notes:
53
-
54
- - `/search` matches string columns (full-text style behavior is internal to the API).
55
- - `/filter` requires predicate syntax in `where` and optional sort in `orderby`.
56
- - Keep filtering and searches read-only and side-effect free.
57
-
58
- ## Querying Datasets
59
-
60
- Use `npx parquetlens` with Hub parquet alias paths for SQL querying.
61
-
62
- Parquet alias shape:
63
-
64
- ```text
65
- hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet
66
- ```
67
-
68
- Derive `<config>`, `<split>`, and `<shard>` from Dataset Viewer `/parquet`:
69
-
70
- ```bash
71
- curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
72
- | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'
73
- ```
74
-
75
- Run SQL query:
76
-
77
- ```bash
78
- npx -y -p parquetlens -p @parquetlens/sql parquetlens \
79
- "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
80
- --sql "SELECT * FROM data LIMIT 20"
81
- ```
82
-
83
- ### SQL export
84
-
85
- - CSV: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"`
86
- - JSON: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"`
87
- - Parquet: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"`
88
-
89
- ## Creating and Uploading Datasets
90
-
91
- Use one of these flows depending on dependency constraints.
92
-
93
- Zero local dependencies (Hub UI):
94
-
95
- - Create dataset repo in browser: `https://huggingface.co/new-dataset`
96
- - Upload parquet files in the repo "Files and versions" page.
97
- - Verify shards appear in Dataset Viewer:
98
-
99
- ```bash
100
- curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"
101
- ```
102
-
103
- Low dependency CLI flow (`npx @huggingface/hub` / `hfjs`):
104
-
105
- - Set auth token:
106
-
107
- ```bash
108
- export HF_TOKEN=<your_hf_token>
109
- ```
110
-
111
- - Upload parquet folder to a dataset repo (auto-creates repo if missing):
112
-
113
- ```bash
114
- npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data
115
- ```
116
-
117
- - Upload as private repo on creation:
118
-
119
- ```bash
120
- npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private
121
- ```
122
-
123
- After upload, call `/parquet` to discover `<config>/<split>/<shard>` values for querying with `@~parquet`.
124
-
125
-
126
10
  ## When to Use
127
- Use this skill when tackling tasks related to its primary domain or functionality as described above.
11
+
12
+ Use this skill when you need read-only exploration of a Hugging Face dataset through the Dataset Viewer API.
13
+
14
+ Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction.
15
+
16
+ ## Core workflow
17
+
18
+ 1. Optionally validate dataset availability with `/is-valid`.
19
+ 2. Resolve `config` + `split` with `/splits`.
20
+ 3. Preview with `/first-rows`.
21
+ 4. Paginate content with `/rows` using `offset` and `length` (max 100).
22
+ 5. Use `/search` for text matching and `/filter` for row predicates.
23
+ 6. Retrieve parquet links via `/parquet` and totals/metadata via `/size` and `/statistics`.
24
+
25
+ ## Defaults
26
+
27
+ - Base URL: `https://datasets-server.huggingface.co`
28
+ - Default API method: `GET`
29
+ - Query params should be URL-encoded.
30
+ - `offset` is 0-based.
31
+ - `length` max is usually `100` for row-like endpoints.
32
+ - Gated/private datasets require `Authorization: Bearer <HF_TOKEN>`.
33
+
34
+ ## Dataset Viewer
35
+
36
+ - `Validate dataset`: `/is-valid?dataset=<namespace/repo>`
37
+ - `List subsets and splits`: `/splits?dataset=<namespace/repo>`
38
+ - `Preview first rows`: `/first-rows?dataset=<namespace/repo>&config=<config>&split=<split>`
39
+ - `Paginate rows`: `/rows?dataset=<namespace/repo>&config=<config>&split=<split>&offset=<int>&length=<int>`
40
+ - `Search text`: `/search?dataset=<namespace/repo>&config=<config>&split=<split>&query=<text>&offset=<int>&length=<int>`
41
+ - `Filter with predicates`: `/filter?dataset=<namespace/repo>&config=<config>&split=<split>&where=<predicate>&orderby=<sort>&offset=<int>&length=<int>`
42
+ - `List parquet shards`: `/parquet?dataset=<namespace/repo>`
43
+ - `Get size totals`: `/size?dataset=<namespace/repo>`
44
+ - `Get column statistics`: `/statistics?dataset=<namespace/repo>&config=<config>&split=<split>`
45
+ - `Get Croissant metadata (if available)`: `/croissant?dataset=<namespace/repo>`
46
+
47
+ Pagination pattern:
48
+
49
+ ```bash
50
+ curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100"
51
+ curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100"
52
+ ```
53
+
54
+ When pagination is partial, use response fields such as `num_rows_total`, `num_rows_per_page`, and `partial` to drive continuation logic.
55
+
56
+ Search/filter notes:
57
+
58
+ - `/search` matches string columns (full-text style behavior is internal to the API).
59
+ - `/filter` requires predicate syntax in `where` and optional sort in `orderby`.
60
+ - Keep filtering and searches read-only and side-effect free.
61
+
62
+ ## Querying Datasets
63
+
64
+ Use `npx parquetlens` with Hub parquet alias paths for SQL querying.
65
+
66
+ Parquet alias shape:
67
+
68
+ ```text
69
+ hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet
70
+ ```
71
+
72
+ Derive `<config>`, `<split>`, and `<shard>` from Dataset Viewer `/parquet`:
73
+
74
+ ```bash
75
+ curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \
76
+ | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"'
77
+ ```
78
+
79
+ Run SQL query:
80
+
81
+ ```bash
82
+ npx -y -p parquetlens -p @parquetlens/sql parquetlens \
83
+ "hf://datasets/<namespace>/<repo>@~parquet/<config>/<split>/<shard>.parquet" \
84
+ --sql "SELECT * FROM data LIMIT 20"
85
+ ```
86
+
87
+ ### SQL export
88
+
89
+ - CSV: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"`
90
+ - JSON: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"`
91
+ - Parquet: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"`
92
+
93
+ ## Creating and Uploading Datasets
94
+
95
+ Use one of these flows depending on dependency constraints.
96
+
97
+ Zero local dependencies (Hub UI):
98
+
99
+ - Create dataset repo in browser: `https://huggingface.co/new-dataset`
100
+ - Upload parquet files in the repo "Files and versions" page.
101
+ - Verify shards appear in Dataset Viewer:
102
+
103
+ ```bash
104
+ curl -s "https://datasets-server.huggingface.co/parquet?dataset=<namespace>/<repo>"
105
+ ```
106
+
107
+ Low dependency CLI flow (`npx @huggingface/hub` / `hfjs`):
108
+
109
+ - Set auth token:
110
+
111
+ ```bash
112
+ export HF_TOKEN=<your_hf_token>
113
+ ```
114
+
115
+ - Upload parquet folder to a dataset repo (auto-creates repo if missing):
116
+
117
+ ```bash
118
+ npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data
119
+ ```
120
+
121
+ - Upload as private repo on creation:
122
+
123
+ ```bash
124
+ npx -y @huggingface/hub upload datasets/<namespace>/<repo> ./local/parquet-folder data --private
125
+ ```
126
+
127
+ After upload, call `/parquet` to discover `<config>/<split>/<shard>` values for querying with `@~parquet`.