hf2vespa 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,820 @@
1
+ Metadata-Version: 2.4
2
+ Name: hf2vespa
3
+ Version: 0.1.0
4
+ Summary: Stream HuggingFace datasets to Vespa JSON format
5
+ Author-email: Thomas Thoresen <thomas.h.thoresen@gmail.com>
6
+ License: Apache-2.0
7
+ Project-URL: Homepage, https://github.com/thomasht86/hf2vespa
8
+ Project-URL: Repository, https://github.com/thomasht86/hf2vespa
9
+ Project-URL: Issues, https://github.com/thomasht86/hf2vespa/issues
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.10
12
+ Classifier: Programming Language :: Python :: 3.11
13
+ Classifier: Programming Language :: Python :: 3.12
14
+ Classifier: Programming Language :: Python :: 3.13
15
+ Classifier: Programming Language :: Python :: 3.14
16
+ Classifier: License :: OSI Approved :: Apache Software License
17
+ Requires-Python: >=3.10
18
+ Description-Content-Type: text/markdown
19
+ Requires-Dist: typer>=0.21.0
20
+ Requires-Dist: datasets>=4.5.0
21
+ Requires-Dist: orjson>=3.11.0
22
+ Requires-Dist: pyyaml>=6.0.0
23
+ Requires-Dist: pydantic>=2.12.0
24
+ Requires-Dist: tqdm>=4.66.0
25
+ Requires-Dist: ruamel.yaml>=0.19.0
26
+
27
+ # hf2vespa
28
+
29
+ Stream HuggingFace datasets to Vespa JSON format
30
+
31
+ [![asciicast](https://asciinema.org/a/kdD2bsVNFUL51Era.svg)](https://asciinema.org/a/kdD2bsVNFUL51Era)
32
+
33
+ ## Description
34
+
35
+ A command-line tool for streaming HuggingFace datasets directly to Vespa's JSON feed format without intermediate files or loading entire datasets into memory. Define field mappings via YAML configuration or CLI arguments, then pipe output directly to `vespa feed -` for efficient ingestion of millions of records.
36
+
37
+ ## Installation
38
+
39
+ ### with uv
40
+
41
+ Run with uvx for a fast, isolated installation:
42
+
43
+ ```bash
44
+ uvx hf2vespa
45
+ ```
46
+
47
+ or install globally:
48
+
49
+ ```bash
50
+ uv tool install hf2vespa
51
+ ```
52
+
53
+ or with pip:
54
+
55
+ ```bash
56
+ pip install hf2vespa
57
+ ```
58
+
59
+ ### From Source
60
+
61
+ ```bash
62
+ git clone https://github.com/thomasht86/hf2vespa.git
63
+ cd hf2vespa
64
+ uv tool install .
65
+ ```
66
+
67
+ **Requirements:** Python 3.10+
68
+
69
+ ## Quick Start
70
+
71
+ ### Basic Usage
72
+
73
+ Stream a HuggingFace dataset to Vespa JSON format:
74
+
75
+ ```bash
76
+ hf2vespa feed glue --config ax --split test --limit 5
77
+ ```
78
+
79
+ **Output:**
80
+ ```json
81
+ {"put":"id:doc:doc::0","fields":{"premise":"The cat sat on the mat.","hypothesis":"The cat did not sit on the mat.","label":-1,"idx":0}}
82
+ {"put":"id:doc:doc::1","fields":{"premise":"The cat did not sit on the mat.","hypothesis":"The cat sat on the mat.","label":-1,"idx":1}}
83
+ {"put":"id:doc:doc::2","fields":{"premise":"When you've got no snow...","hypothesis":"When you've got snow...","label":-1,"idx":2}}
84
+ ```
85
+
86
+ ```
87
+ --- Completion Statistics ---
88
+ Total records processed: 5
89
+ Successful: 5
90
+ Errors: 0
91
+ Throughput: 2.1 records/sec
92
+ Elapsed time: 2.38s
93
+ ```
94
+
95
+ ### Preview Dataset Schema
96
+
97
+ Inspect a dataset and generate a YAML configuration template:
98
+
99
+ ```bash
100
+ hf2vespa init glue --config ax --split test --output config.yaml
101
+ ```
102
+
103
+ **Generated config.yaml:**
104
+ ```yaml
105
+ namespace: doc
106
+ doctype: doc
107
+ id_column: # null = auto-increment
108
+
109
+ mappings:
110
+ - source: premise
111
+ target: premise
112
+ type: # string
113
+ - source: hypothesis
114
+ target: hypothesis
115
+ type: # string
116
+ ```
117
+
118
+ ### Use Config File
119
+
120
+ Apply the generated configuration:
121
+
122
+ ```bash
123
+ hf2vespa feed glue --config ax --split test --config-file config.yaml --limit 5
124
+ ```
125
+
126
+ The config file defines field mappings, document IDs, and type conversions (e.g., converting lists to Vespa tensor format).
127
+
128
+ ## YAML Configuration
129
+
130
+ Configuration files define field mappings and document settings for Vespa feed generation.
131
+
132
+ ### Basic Structure
133
+
134
+ The minimal YAML configuration:
135
+
136
+ ```yaml
137
+ # Vespa document settings
138
+ namespace: doc # Namespace for document IDs
139
+ doctype: doc # Document type name
140
+ id_column: # Column to use as document ID (optional, auto-increment if omitted)
141
+
142
+ # Field mappings (optional - all columns included by default)
143
+ mappings:
144
+ - source: text # Dataset column name
145
+ target: body # Vespa field name (optional, defaults to source)
146
+ type: string # Type converter (optional)
147
+ ```
148
+
149
+ All fields are optional. If you omit `mappings`, all dataset columns are included as-is. The `target` field defaults to the `source` field name if not specified.
150
+
151
+ ### Field Types
152
+
153
+ Type converters transform dataset values into Vespa-compatible formats.
154
+
155
+ #### Basic Types
156
+
157
+ | Type | Purpose | Example Input | Vespa Output |
158
+ |------|---------|---------------|--------------|
159
+ | `string` | Text data | `123` | `"123"` |
160
+ | `int` | Integer values | `"42"` | `42` |
161
+ | `float` | Decimal values | `"3.14"` | `3.14` |
162
+ | `tensor` | Vector embeddings | `[0.1, 0.2, 0.3]` | `{"values": [0.1, 0.2, 0.3]}` |
163
+
164
+ #### Hex-Encoded Tensors (v2.0)
165
+
166
+ Memory-efficient tensor formats using hex encoding:
167
+
168
+ | Type | Cell Type | Hex Chars/Value | Use Case |
169
+ |------|-----------|-----------------|----------|
170
+ | `tensor_int8_hex` | int8 (-128 to 127) | 2 | Quantized embeddings |
171
+ | `tensor_bfloat16_hex` | bfloat16 | 4 | ML model weights |
172
+ | `tensor_float32_hex` | float32 | 8 | Standard precision |
173
+ | `tensor_float64_hex` | float64 | 16 | High precision |
174
+
175
+ ```yaml
176
+ mappings:
177
+ - source: quantized_embedding
178
+ target: qvector
179
+ type: tensor_int8_hex # [11, 34, 3] → {"values": "0b2203"}
180
+ ```
181
+
182
+ #### Scalar Types (v2.0)
183
+
184
+ | Type | Purpose | Example Input | Vespa Output |
185
+ |------|---------|---------------|--------------|
186
+ | `position` | Geo coordinates | `{"lat": 37.4, "lng": -122.0}` | `{"lat": 37.4, "lng": -122.0}` |
187
+ | `weightedset` | Term weights | `{"tag1": 10, "tag2": 5}` | `{"tag1": 10, "tag2": 5}` |
188
+ | `map` | Key-value pairs | `{1: "one", 2: "two"}` | `{"1": "one", "2": "two"}` |
189
+
190
+ #### Sparse and Mixed Tensors (v2.0)
191
+
192
+ For advanced tensor structures like ColBERT-style multi-vector embeddings:
193
+
194
+ | Type | Purpose | Use Case |
195
+ |------|---------|----------|
196
+ | `sparse_tensor` | Single mapped dimension | Term weights, feature importance |
197
+ | `mixed_tensor` | Mapped + indexed dimensions | Multi-vector embeddings |
198
+ | `mixed_tensor_hex` | Mapped + hex-encoded indexed | Memory-efficient multi-vectors |
199
+
200
+ ```yaml
201
+ mappings:
202
+ # Sparse tensor: {"word1": 0.8, "word2": 0.5} → {"cells": [{"address": {"key": "word1"}, "value": 0.8}, ...]}
203
+ - source: term_weights
204
+ target: weights
205
+ type: sparse_tensor
206
+
207
+ # Mixed tensor: {"w1": [1.0, 2.0], "w2": [3.0, 4.0]} → {"blocks": {"w1": [1.0, 2.0], "w2": [3.0, 4.0]}}
208
+ - source: token_embeddings
209
+ target: colbert
210
+ type: mixed_tensor
211
+ ```
212
+
213
+ If `type` is omitted, values are passed through as-is (no conversion).
214
+
215
+ ### Complete Examples
216
+
217
+ #### 1. Basic text dataset (rename columns)
218
+
219
+ Simple configuration that renames columns without type conversion:
220
+
221
+ ```yaml
222
+ namespace: docs
223
+ doctype: article
224
+
225
+ mappings:
226
+ - source: text
227
+ target: body
228
+ - source: title
229
+ target: headline
230
+ ```
231
+
232
+ #### 2. Dataset with embeddings (tensor conversion)
233
+
234
+ Configuration with type conversions for embedding vectors:
235
+
236
+ ```yaml
237
+ namespace: search
238
+ doctype: document
239
+ id_column: doc_id
240
+
241
+ mappings:
242
+ - source: content
243
+ target: text
244
+ type: string
245
+ - source: embedding
246
+ target: vector
247
+ type: tensor
248
+ ```
249
+
250
+ #### 3. Generated config example
251
+
252
+ This is what the `init` command produces when you inspect a dataset schema:
253
+
254
+ ```yaml
255
+ namespace: doc
256
+ doctype: doc
257
+ id_column: # null = auto-increment
258
+
259
+ mappings:
260
+ - source: premise
261
+ target: premise
262
+ type: # string
263
+ - source: hypothesis
264
+ target: hypothesis
265
+ type: # string
266
+ - source: label
267
+ target: label
268
+ type: # int
269
+ ```
270
+
271
+ The commented type hints show inferred types based on dataset schema. Uncomment and modify as needed.
272
+
273
+ **Tip:** Use `hf2vespa init <dataset>` to generate a starter config with all fields detected from the dataset schema.
274
+
275
+ ## CLI Reference
276
+
277
+ ### `hf2vespa feed`
278
+
279
+ Stream HuggingFace dataset to Vespa JSON format.
280
+
281
+ **Usage:**
282
+ ```bash
283
+ hf2vespa feed DATASET [OPTIONS]
284
+ ```
285
+
286
+ **Arguments:**
287
+ - `DATASET` - HuggingFace dataset name (required)
288
+
289
+ **Options:**
290
+ - `--split TEXT` - Dataset split to use [default: train]
291
+ - `--config TEXT` - Dataset config name (for multi-config datasets like glue)
292
+ - `--include TEXT` - Columns to include (repeatable, e.g., `--include title --include text`)
293
+ - `--rename TEXT` - Rename columns as 'old:new' (repeatable, e.g., `--rename text:body`)
294
+ - `--namespace TEXT` - Vespa namespace for document IDs [default: doc]
295
+ - `--doctype TEXT` - Vespa document type [default: doc]
296
+ - `--config-file PATH` - YAML configuration file for field mappings
297
+ - `--limit INTEGER` - Process only first N records (useful for testing)
298
+ - `--id-column TEXT` - Dataset column to use as document ID (omit for auto-increment)
299
+ - `--on-error [fail|skip]` - Error handling mode [default: fail]
300
+ - `--num-workers INTEGER` - Number of parallel workers for dataset loading [default: CPU count]
301
+
302
+ **Examples:**
303
+
304
+ Basic streaming:
305
+ ```bash
306
+ hf2vespa feed glue --config ax
307
+ ```
308
+
309
+ Stream specific split with limit:
310
+ ```bash
311
+ hf2vespa feed glue --config ax --split test --limit 10
312
+ ```
313
+
314
+ Filter specific columns:
315
+ ```bash
316
+ hf2vespa feed glue --config ax --include premise --include hypothesis
317
+ ```
318
+
319
+ Custom namespace and doctype:
320
+ ```bash
321
+ hf2vespa feed squad --namespace wiki --doctype article
322
+ ```
323
+
324
+ Use config file for complex mappings:
325
+ ```bash
326
+ hf2vespa feed squad --config-file vespa-config.yaml
327
+ ```
328
+
329
+ Skip errors instead of failing:
330
+ ```bash
331
+ hf2vespa feed my-dataset --on-error skip
332
+ ```
333
+
334
+ ---
335
+
336
+ ### `hf2vespa init`
337
+
338
+ Generate a YAML config by inspecting a HuggingFace dataset schema.
339
+
340
+ **Usage:**
341
+ ```bash
342
+ hf2vespa init DATASET [OPTIONS]
343
+ ```
344
+
345
+ **Arguments:**
346
+ - `DATASET` - HuggingFace dataset name (required)
347
+
348
+ **Options:**
349
+ - `-o, --output PATH` - Output file path [default: vespa-config.yaml]
350
+ - `-s, --split TEXT` - Dataset split to inspect [default: train]
351
+ - `-c, --config TEXT` - Dataset config name (required for multi-config datasets)
352
+
353
+ **Examples:**
354
+
355
+ Generate config for a multi-config dataset:
356
+ ```bash
357
+ hf2vespa init glue --config ax
358
+ ```
359
+
360
+ Specify output file:
361
+ ```bash
362
+ hf2vespa init squad --output my-config.yaml
363
+ ```
364
+
365
+ Inspect a specific split:
366
+ ```bash
367
+ hf2vespa init my-dataset --split validation --output val-config.yaml
368
+ ```
369
+
370
+ ---
371
+
372
+ ### `hf2vespa install-completion`
373
+
374
+ Install shell tab-completion for hf2vespa.
375
+
376
+ **Usage:**
377
+ ```bash
378
+ hf2vespa install-completion [SHELL]
379
+ ```
380
+
381
+ **Arguments:**
382
+ - `SHELL` - Shell type (bash, zsh, fish). Auto-detected if omitted.
383
+
384
+ **Examples:**
385
+
386
+ Auto-detect shell:
387
+ ```bash
388
+ hf2vespa install-completion
389
+ ```
390
+
391
+ Explicit shell:
392
+ ```bash
393
+ hf2vespa install-completion bash
394
+ ```
395
+
396
+ After installation, restart your shell or source your shell config file (e.g., `source ~/.bashrc`).
397
+
398
+ ---
399
+
400
+ ### Backward Compatibility
401
+
402
+ For convenience, the `feed` subcommand can be omitted:
403
+
404
+ ```bash
405
+ # These are equivalent:
406
+ hf2vespa feed glue --config ax
407
+ hf2vespa glue --config ax
408
+ ```
409
+
410
+ However, we recommend using the explicit `feed` subcommand for clarity, especially in scripts.
411
+
412
+ ## Cookbook
413
+
414
+ Real-world examples using public HuggingFace datasets. All commands are copy-paste ready.
415
+
416
+ ### Example 1: Question Answering (SQuAD)
417
+
418
+ Stream Stanford Question Answering Dataset:
419
+
420
+ ```bash
421
+ # Generate config
422
+ hf2vespa init squad --output squad-config.yaml
423
+
424
+ # Preview data structure
425
+ hf2vespa feed squad --limit 3
426
+
427
+ # Full streaming with custom doctype
428
+ hf2vespa feed squad --doctype qa --namespace squad > squad-feed.jsonl
429
+ ```
430
+
431
+ **Output format:** Each record contains `id`, `title`, `context`, `question`, `answers` fields.
432
+
433
+ ---
434
+
435
+ ### Example 2: Text Classification (GLUE)
436
+
437
+ Stream GLUE benchmark tasks for NLU:
438
+
439
+ ```bash
440
+ # MRPC (paraphrase detection)
441
+ hf2vespa feed glue --config mrpc --limit 5
442
+
443
+ # SST-2 (sentiment analysis)
444
+ hf2vespa feed glue --config sst2 --namespace sentiment --limit 5
445
+
446
+ # With column filtering (ax only has test split)
447
+ hf2vespa feed glue --config ax --split test --include premise --include hypothesis
448
+ ```
449
+
450
+ ---
451
+
452
+ ### Example 3: Retrieval (MS MARCO)
453
+
454
+ Stream MS MARCO passage retrieval dataset:
455
+
456
+ ```bash
457
+ # Generate config to see structure
458
+ hf2vespa init ms_marco --config v1.1 --output msmarco-config.yaml
459
+
460
+ # Stream passages
461
+ hf2vespa feed ms_marco --config v1.1 --doctype passage --limit 1000
462
+ ```
463
+
464
+ ---
465
+
466
+ ### Example 4: Wikipedia
467
+
468
+ Stream Wikipedia articles:
469
+
470
+ ```bash
471
+ # Check available configs (language editions)
472
+ # Use 20220301.en for English Wikipedia snapshot
473
+
474
+ hf2vespa init wikipedia --config 20220301.simple --output wiki-config.yaml
475
+ hf2vespa feed wikipedia --config 20220301.simple --limit 100 --doctype article
476
+ ```
477
+
478
+ Note: Full Wikipedia is large. Use `--limit` for testing.
479
+
480
+ ---
481
+
482
+ ### Example 5: Custom Embeddings Dataset
483
+
484
+ For datasets with pre-computed embeddings:
485
+
486
+ ```yaml
487
+ # embedding-config.yaml
488
+ namespace: vectors
489
+ doctype: document
490
+ id_column: doc_id
491
+
492
+ mappings:
493
+ - source: text
494
+ target: content
495
+ type: string
496
+ - source: embedding
497
+ target: vector
498
+ type: tensor
499
+ ```
500
+
501
+ ```bash
502
+ hf2vespa feed your-embedding-dataset --config-file embedding-config.yaml
503
+ ```
504
+
505
+ The `tensor` type converts Python lists to Vespa tensor format: `{"values": [0.1, 0.2, ...]}`
506
+
507
+ ---
508
+
509
+ ### Example 6: Hex-Encoded Embeddings (v2.0)
510
+
511
+ For memory-efficient embedding storage, use hex-encoded tensors:
512
+
513
+ ```yaml
514
+ # hex-embedding-config.yaml
515
+ namespace: search
516
+ doctype: document
517
+ id_column: doc_id
518
+
519
+ mappings:
520
+ - source: text
521
+ target: content
522
+ type: string
523
+ # Full precision (8 hex chars per value)
524
+ - source: embedding
525
+ target: vector_f32
526
+ type: tensor_float32_hex
527
+ # Quantized (2 hex chars per value, 4x smaller)
528
+ - source: quantized_embedding
529
+ target: vector_int8
530
+ type: tensor_int8_hex
531
+ ```
532
+
533
+ ```bash
534
+ hf2vespa feed your-embedding-dataset --config-file hex-embedding-config.yaml
535
+ ```
536
+
537
+ ---
538
+
539
+ ### Example 7: ColBERT Multi-Vector Embeddings (v2.0)
540
+
541
+ For ColBERT-style token-level embeddings:
542
+
543
+ ```yaml
544
+ # colbert-config.yaml
545
+ namespace: colbert
546
+ doctype: passage
547
+ id_column: passage_id
548
+
549
+ mappings:
550
+ - source: text
551
+ target: content
552
+ # Token embeddings: {"token1": [0.1, 0.2, ...], "token2": [...]}
553
+ - source: token_embeddings
554
+ target: colbert_rep
555
+ type: mixed_tensor_hex # Uses float32 hex by default
556
+ ```
557
+
558
+ The `mixed_tensor_hex` type supports `cell_type` options: `int8`, `bfloat16`, `float32` (default), `float64`.
559
+
560
+ ---
561
+
562
+ ### Example 8: Geo and Weighted Data (v2.0)
563
+
564
+ For location-aware search with term weights:
565
+
566
+ ```yaml
567
+ # geo-weighted-config.yaml
568
+ namespace: places
569
+ doctype: venue
570
+
571
+ mappings:
572
+ - source: name
573
+ target: title
574
+ # Geo coordinates for geo-search
575
+ - source: coordinates
576
+ target: location
577
+ type: position # {"lat": 37.4, "lng": -122.0}
578
+ # Category weights for boosting
579
+ - source: categories
580
+ target: category_weights
581
+ type: weightedset # {"restaurant": 10, "cafe": 5}
582
+ ```
583
+
584
+ ---
585
+
586
+ ### Piping to Vespa
587
+
588
+ Stream directly to a Vespa instance:
589
+
590
+ ```bash
591
+ # Using vespa-cli
592
+ hf2vespa feed squad --limit 1000 | vespa feed -
593
+
594
+ # Or save and feed later
595
+ hf2vespa feed squad > feed.jsonl
596
+ vespa feed feed.jsonl
597
+ ```
598
+
599
+ ## Type Reference (v2.0)
600
+
601
+ Complete reference for all supported type converters.
602
+
603
+ ### Basic Types
604
+
605
+ #### `string`
606
+ Converts any value to string.
607
+ ```
608
+ Input: 123 → Output: "123"
609
+ ```
610
+
611
+ #### `int`
612
+ Converts value to integer.
613
+ ```
614
+ Input: "42" → Output: 42
615
+ ```
616
+
617
+ #### `float`
618
+ Converts value to float.
619
+ ```
620
+ Input: "3.14" → Output: 3.14
621
+ ```
622
+
623
+ #### `tensor`
624
+ Converts list to Vespa indexed tensor (JSON array format).
625
+ ```
626
+ Input: [0.1, 0.2, 0.3]
627
+ Output: {"values": [0.1, 0.2, 0.3]}
628
+ ```
629
+
630
+ ### Hex-Encoded Tensors
631
+
632
+ Memory-efficient tensor encoding for embeddings. Values are packed as binary and hex-encoded.
633
+
634
+ #### `tensor_int8_hex`
635
+ 8-bit signed integers (-128 to 127). 2 hex chars per value.
636
+ ```
637
+ Input: [11, 34, 3]
638
+ Output: {"values": "0b2203"}
639
+ ```
640
+ **Use case:** Quantized embeddings, reduced storage (4x smaller than float32).
641
+
642
+ #### `tensor_bfloat16_hex`
643
+ Brain floating point (truncated float32). 4 hex chars per value.
644
+ ```
645
+ Input: [1.0, -1.0, 0.0]
646
+ Output: {"values": "3f80bf800000"}
647
+ ```
648
+ **Use case:** ML model weights, good range with reduced precision.
649
+
650
+ #### `tensor_float32_hex`
651
+ IEEE 754 single precision. 8 hex chars per value.
652
+ ```
653
+ Input: [3.14159]
654
+ Output: {"values": "40490fdb"}
655
+ ```
656
+ **Use case:** Standard embedding precision.
657
+
658
+ #### `tensor_float64_hex`
659
+ IEEE 754 double precision. 16 hex chars per value.
660
+ ```
661
+ Input: [3.141592653589793]
662
+ Output: {"values": "400921fb54442d18"}
663
+ ```
664
+ **Use case:** High-precision scientific data.
665
+
666
+ ### Scalar Types
667
+
668
+ #### `position`
669
+ Geo coordinates for location-based search.
670
+ ```
671
+ Input: {"lat": 37.4, "lng": -122.0}
672
+ Output: {"lat": 37.4, "lng": -122.0}
673
+ ```
674
+ **Validation:** Latitude must be -90 to 90, longitude -180 to 180.
675
+
676
+ #### `weightedset`
677
+ Key-weight pairs for weighted search.
678
+ ```
679
+ Input: {"tag1": 10, "tag2": 5}
680
+ Output: {"tag1": 10, "tag2": 5}
681
+ ```
682
+ **Note:** Keys are stringified, weights converted to integers.
683
+
684
+ #### `map`
685
+ Generic key-value maps.
686
+ ```
687
+ Input: {1: "one", 2: "two"}
688
+ Output: {"1": "one", "2": "two"}
689
+ ```
690
+ **Note:** Keys are stringified.
691
+
692
+ ### Sparse and Mixed Tensors
693
+
694
+ #### `sparse_tensor`
695
+ Single mapped dimension using Vespa cells notation.
696
+ ```
697
+ Input: {"word1": 0.8, "word2": 0.5}
698
+ Output: {"cells": [{"address": {"key": "word1"}, "value": 0.8},
699
+ {"address": {"key": "word2"}, "value": 0.5}]}
700
+ ```
701
+ **Use case:** Term weights, feature importance scores.
702
+
703
+ #### `mixed_tensor`
704
+ Combined mapped + indexed dimensions using Vespa blocks notation.
705
+ ```
706
+ Input: {"w1": [1.0, 2.0], "w2": [3.0, 4.0]}
707
+ Output: {"blocks": {"w1": [1.0, 2.0], "w2": [3.0, 4.0]}}
708
+ ```
709
+ **Use case:** ColBERT-style multi-vector embeddings.
710
+ **Validation:** All block arrays must have the same length.
711
+
712
+ #### `mixed_tensor_hex`
713
+ Mixed tensor with hex-encoded dense dimensions.
714
+ ```
715
+ Input: {"w1": [11, 34, 3], "w2": [-124, 5, -1]} (with cell_type=int8)
716
+ Output: {"blocks": {"w1": "0b2203", "w2": "8405ff"}}
717
+ ```
718
+ **Cell types:** `int8`, `bfloat16`, `float32` (default), `float64`
719
+ **Use case:** Memory-efficient ColBERT embeddings.
720
+
721
+ ## Troubleshooting
722
+
723
+ ### Authentication Errors
724
+
725
+ **Symptom:** `401 Unauthorized` or `403 Forbidden` when accessing private/gated datasets
726
+
727
+ **Cause:** HuggingFace authentication token not provided or invalid
728
+
729
+ **Solution:**
730
+
731
+ ```bash
732
+ # Option 1: Environment variable
733
+ export HF_TOKEN=your_token_here
734
+ hf2vespa feed your-private-dataset
735
+
736
+ # Option 2: HuggingFace CLI login (persistent)
737
+ pip install huggingface_hub
738
+ huggingface-cli login
739
+ ```
740
+
741
+ Get your token at: https://huggingface.co/settings/tokens
742
+
743
+ ---
744
+
745
+ ### Memory Issues with Large Datasets
746
+
747
+ **Symptom:** Process killed, MemoryError, or system becomes unresponsive
748
+
749
+ **Cause:** Dataset too large to fit in memory (this tool uses HF datasets streaming)
750
+
751
+ **Solution:**
752
+
753
+ ```bash
754
+ # Use --limit to process in batches
755
+ hf2vespa feed large-dataset --limit 10000 > batch1.jsonl
756
+ hf2vespa feed large-dataset --limit 10000 --skip 10000 > batch2.jsonl
757
+
758
+ # Or pipe directly to Vespa (recommended)
759
+ hf2vespa feed large-dataset | vespa feed -
760
+ ```
761
+
762
+ Note: HuggingFace datasets library handles streaming efficiently. Memory issues are rare but can occur with very wide datasets (many columns) or large individual records.
763
+
764
+ ---
765
+
766
+ ### Type Conversion Errors
767
+
768
+ **Symptom:** `TypeError` or `ValueError` during feed generation
769
+
770
+ **Cause:** Column type doesn't match expected converter (e.g., tensor on non-list field)
771
+
772
+ **Solution:**
773
+
774
+ 1. Check your YAML config mappings
775
+ 2. Verify the source column type with `init`:
776
+ ```bash
777
+ hf2vespa init your-dataset --config your-config
778
+ ```
779
+ 3. Match converter type to actual data:
780
+ - Use `tensor` only for list/sequence columns (embeddings)
781
+ - Use `string`, `int`, `float` for scalar values
782
+
783
+ ---
784
+
785
+ ### Multi-Config Dataset Errors
786
+
787
+ **Symptom:** "This dataset has multiple configurations" error
788
+
789
+ **Cause:** Dataset requires a `--config` argument (like `glue`, `super_glue`, etc.)
790
+
791
+ **Solution:**
792
+
793
+ ```bash
794
+ # List available configs (check HuggingFace dataset page)
795
+ # Then specify one:
796
+ hf2vespa feed glue --config ax
797
+ hf2vespa init glue --config cola
798
+ ```
799
+
800
+ ---
801
+
802
+ ### Dataset Not Found
803
+
804
+ **Symptom:** `DatasetNotFoundError` or 404 error
805
+
806
+ **Cause:** Dataset name misspelled, private without auth, or doesn't exist
807
+
808
+ **Solution:**
809
+
810
+ 1. Verify dataset exists on HuggingFace Hub
811
+ 2. Check spelling (case-sensitive)
812
+ 3. For private datasets, ensure HF_TOKEN is set (see Authentication Errors)
813
+
814
+ ## Contributing
815
+
816
+ Issues and pull requests are welcome. Please open an issue to discuss major changes before submitting PRs.
817
+
818
+ ## License
819
+
820
+ MIT License - see repository for details.