PyPI - hf2vespa - Versions diffs - 0.1.0__py3-none-any.whl - Mend

hf2vespa 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

hf2vespa/__init__.py +1 -0
hf2vespa/__main__.py +4 -0
hf2vespa/cli.py +465 -0
hf2vespa/config.py +131 -0
hf2vespa/converters.py +648 -0
hf2vespa/init.py +351 -0
hf2vespa/pipeline.py +198 -0
hf2vespa/stats.py +76 -0
hf2vespa/utils.py +57 -0
hf2vespa-0.1.0.dist-info/METADATA +820 -0
hf2vespa-0.1.0.dist-info/RECORD +14 -0
hf2vespa-0.1.0.dist-info/WHEEL +5 -0
hf2vespa-0.1.0.dist-info/entry_points.txt +2 -0
hf2vespa-0.1.0.dist-info/top_level.txt +1 -0

hf2vespa-0.1.0.dist-info/METADATA ADDED Viewed

@@ -0,0 +1,820 @@
+Metadata-Version: 2.4
+Name: hf2vespa
+Version: 0.1.0
+Summary: Stream HuggingFace datasets to Vespa JSON format
+Author-email: Thomas Thoresen <thomas.h.thoresen@gmail.com>
+License: Apache-2.0
+Project-URL: Homepage, https://github.com/thomasht86/hf2vespa
+Project-URL: Repository, https://github.com/thomasht86/hf2vespa
+Project-URL: Issues, https://github.com/thomasht86/hf2vespa/issues
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Classifier: License :: OSI Approved :: Apache Software License
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+Requires-Dist: typer>=0.21.0
+Requires-Dist: datasets>=4.5.0
+Requires-Dist: orjson>=3.11.0
+Requires-Dist: pyyaml>=6.0.0
+Requires-Dist: pydantic>=2.12.0
+Requires-Dist: tqdm>=4.66.0
+Requires-Dist: ruamel.yaml>=0.19.0
+# hf2vespa
+Stream HuggingFace datasets to Vespa JSON format
+[![asciicast](https://asciinema.org/a/kdD2bsVNFUL51Era.svg)](https://asciinema.org/a/kdD2bsVNFUL51Era)
+## Description
+A command-line tool for streaming HuggingFace datasets directly to Vespa's JSON feed format without intermediate files or loading entire datasets into memory. Define field mappings via YAML configuration or CLI arguments, then pipe output directly to `vespa feed -` for efficient ingestion of millions of records.
+## Installation
+### with uv
+Run with uvx for a fast, isolated installation:
+```bash
+uvx hf2vespa
+```
+or install globally:
+```bash
+uv tool install hf2vespa
+```
+or with pip:
+```bash
+pip install hf2vespa
+```
+### From Source
+```bash
+git clone https://github.com/thomasht86/hf2vespa.git
+cd hf2vespa
+uv tool install .
+```
+**Requirements:** Python 3.10+
+## Quick Start
+### Basic Usage
+Stream a HuggingFace dataset to Vespa JSON format:
+```bash
+hf2vespa feed glue --config ax --split test --limit 5
+```
+**Output:**
+```json
+{"put":"id:doc:doc::0","fields":{"premise":"The cat sat on the mat.","hypothesis":"The cat did not sit on the mat.","label":-1,"idx":0}}
+{"put":"id:doc:doc::1","fields":{"premise":"The cat did not sit on the mat.","hypothesis":"The cat sat on the mat.","label":-1,"idx":1}}
+{"put":"id:doc:doc::2","fields":{"premise":"When you've got no snow...","hypothesis":"When you've got snow...","label":-1,"idx":2}}
+```
+```
+--- Completion Statistics ---
+Total records processed: 5
+Successful: 5
+Errors: 0
+Throughput: 2.1 records/sec
+Elapsed time: 2.38s
+```
+### Preview Dataset Schema
+Inspect a dataset and generate a YAML configuration template:
+```bash
+hf2vespa init glue --config ax --split test --output config.yaml
+```
+**Generated config.yaml:**
+```yaml
+namespace: doc
+doctype: doc
+id_column: # null = auto-increment
+mappings:
+  - source: premise
+    target: premise
+    type:  # string
+  - source: hypothesis
+    target: hypothesis
+    type:  # string
+```
+### Use Config File
+Apply the generated configuration:
+```bash
+hf2vespa feed glue --config ax --split test --config-file config.yaml --limit 5
+```
+The config file defines field mappings, document IDs, and type conversions (e.g., converting lists to Vespa tensor format).
+## YAML Configuration
+Configuration files define field mappings and document settings for Vespa feed generation.
+### Basic Structure
+The minimal YAML configuration:
+```yaml
+# Vespa document settings
+namespace: doc              # Namespace for document IDs
+doctype: doc               # Document type name
+id_column:                 # Column to use as document ID (optional, auto-increment if omitted)
+# Field mappings (optional - all columns included by default)
+mappings:
+  - source: text           # Dataset column name
+    target: body           # Vespa field name (optional, defaults to source)
+    type: string           # Type converter (optional)
+```
+All fields are optional. If you omit `mappings`, all dataset columns are included as-is. The `target` field defaults to the `source` field name if not specified.
+### Field Types
+Type converters transform dataset values into Vespa-compatible formats.
+#### Basic Types
+| Type | Purpose | Example Input | Vespa Output |
+|------|---------|---------------|--------------|
+| `string` | Text data | `123` | `"123"` |
+| `int` | Integer values | `"42"` | `42` |
+| `float` | Decimal values | `"3.14"` | `3.14` |
+| `tensor` | Vector embeddings | `[0.1, 0.2, 0.3]` | `{"values": [0.1, 0.2, 0.3]}` |
+#### Hex-Encoded Tensors (v2.0)
+Memory-efficient tensor formats using hex encoding:
+| Type | Cell Type | Hex Chars/Value | Use Case |
+|------|-----------|-----------------|----------|
+| `tensor_int8_hex` | int8 (-128 to 127) | 2 | Quantized embeddings |
+| `tensor_bfloat16_hex` | bfloat16 | 4 | ML model weights |
+| `tensor_float32_hex` | float32 | 8 | Standard precision |
+| `tensor_float64_hex` | float64 | 16 | High precision |
+```yaml
+mappings:
+  - source: quantized_embedding
+    target: qvector
+    type: tensor_int8_hex  # [11, 34, 3] → {"values": "0b2203"}
+```
+#### Scalar Types (v2.0)
+| Type | Purpose | Example Input | Vespa Output |
+|------|---------|---------------|--------------|
+| `position` | Geo coordinates | `{"lat": 37.4, "lng": -122.0}` | `{"lat": 37.4, "lng": -122.0}` |
+| `weightedset` | Term weights | `{"tag1": 10, "tag2": 5}` | `{"tag1": 10, "tag2": 5}` |
+| `map` | Key-value pairs | `{1: "one", 2: "two"}` | `{"1": "one", "2": "two"}` |
+#### Sparse and Mixed Tensors (v2.0)
+For advanced tensor structures like ColBERT-style multi-vector embeddings:
+| Type | Purpose | Use Case |
+|------|---------|----------|
+| `sparse_tensor` | Single mapped dimension | Term weights, feature importance |
+| `mixed_tensor` | Mapped + indexed dimensions | Multi-vector embeddings |
+| `mixed_tensor_hex` | Mapped + hex-encoded indexed | Memory-efficient multi-vectors |
+```yaml
+mappings:
+  # Sparse tensor: {"word1": 0.8, "word2": 0.5} → {"cells": [{"address": {"key": "word1"}, "value": 0.8}, ...]}
+  - source: term_weights
+    target: weights
+    type: sparse_tensor
+  # Mixed tensor: {"w1": [1.0, 2.0], "w2": [3.0, 4.0]} → {"blocks": {"w1": [1.0, 2.0], "w2": [3.0, 4.0]}}
+  - source: token_embeddings
+    target: colbert
+    type: mixed_tensor
+```
+If `type` is omitted, values are passed through as-is (no conversion).
+### Complete Examples
+#### 1. Basic text dataset (rename columns)
+Simple configuration that renames columns without type conversion:
+```yaml
+namespace: docs
+doctype: article
+mappings:
+  - source: text
+    target: body
+  - source: title
+    target: headline
+```
+#### 2. Dataset with embeddings (tensor conversion)
+Configuration with type conversions for embedding vectors:
+```yaml
+namespace: search
+doctype: document
+id_column: doc_id
+mappings:
+  - source: content
+    target: text
+    type: string
+  - source: embedding
+    target: vector
+    type: tensor
+```
+#### 3. Generated config example
+This is what the `init` command produces when you inspect a dataset schema:
+```yaml
+namespace: doc
+doctype: doc
+id_column: # null = auto-increment
+mappings:
+  - source: premise
+    target: premise
+    type:  # string
+  - source: hypothesis
+    target: hypothesis
+    type:  # string
+  - source: label
+    target: label
+    type:  # int
+```
+The commented type hints show inferred types based on dataset schema. Uncomment and modify as needed.
+**Tip:** Use `hf2vespa init <dataset>` to generate a starter config with all fields detected from the dataset schema.
+## CLI Reference
+### `hf2vespa feed`
+Stream HuggingFace dataset to Vespa JSON format.
+**Usage:**
+```bash
+hf2vespa feed DATASET [OPTIONS]
+```
+**Arguments:**
+- `DATASET` - HuggingFace dataset name (required)
+**Options:**
+- `--split TEXT` - Dataset split to use [default: train]
+- `--config TEXT` - Dataset config name (for multi-config datasets like glue)
+- `--include TEXT` - Columns to include (repeatable, e.g., `--include title --include text`)
+- `--rename TEXT` - Rename columns as 'old:new' (repeatable, e.g., `--rename text:body`)
+- `--namespace TEXT` - Vespa namespace for document IDs [default: doc]
+- `--doctype TEXT` - Vespa document type [default: doc]
+- `--config-file PATH` - YAML configuration file for field mappings
+- `--limit INTEGER` - Process only first N records (useful for testing)
+- `--id-column TEXT` - Dataset column to use as document ID (omit for auto-increment)
+- `--on-error [fail|skip]` - Error handling mode [default: fail]
+- `--num-workers INTEGER` - Number of parallel workers for dataset loading [default: CPU count]
+**Examples:**
+Basic streaming:
+```bash
+hf2vespa feed glue --config ax
+```
+Stream specific split with limit:
+```bash
+hf2vespa feed glue --config ax --split test --limit 10
+```
+Filter specific columns:
+```bash
+hf2vespa feed glue --config ax --include premise --include hypothesis
+```
+Custom namespace and doctype:
+```bash
+hf2vespa feed squad --namespace wiki --doctype article
+```
+Use config file for complex mappings:
+```bash
+hf2vespa feed squad --config-file vespa-config.yaml
+```
+Skip errors instead of failing:
+```bash
+hf2vespa feed my-dataset --on-error skip
+```
+---
+### `hf2vespa init`
+Generate a YAML config by inspecting a HuggingFace dataset schema.
+**Usage:**
+```bash
+hf2vespa init DATASET [OPTIONS]
+```
+**Arguments:**
+- `DATASET` - HuggingFace dataset name (required)
+**Options:**
+- `-o, --output PATH` - Output file path [default: vespa-config.yaml]
+- `-s, --split TEXT` - Dataset split to inspect [default: train]
+- `-c, --config TEXT` - Dataset config name (required for multi-config datasets)
+**Examples:**
+Generate config for a multi-config dataset:
+```bash
+hf2vespa init glue --config ax
+```
+Specify output file:
+```bash
+hf2vespa init squad --output my-config.yaml
+```
+Inspect a specific split:
+```bash
+hf2vespa init my-dataset --split validation --output val-config.yaml
+```
+---
+### `hf2vespa install-completion`
+Install shell tab-completion for hf2vespa.
+**Usage:**
+```bash
+hf2vespa install-completion [SHELL]
+```
+**Arguments:**
+- `SHELL` - Shell type (bash, zsh, fish). Auto-detected if omitted.
+**Examples:**
+Auto-detect shell:
+```bash
+hf2vespa install-completion
+```
+Explicit shell:
+```bash
+hf2vespa install-completion bash
+```
+After installation, restart your shell or source your shell config file (e.g., `source ~/.bashrc`).
+---
+### Backward Compatibility
+For convenience, the `feed` subcommand can be omitted:
+```bash
+# These are equivalent:
+hf2vespa feed glue --config ax
+hf2vespa glue --config ax
+```
+However, we recommend using the explicit `feed` subcommand for clarity, especially in scripts.
+## Cookbook
+Real-world examples using public HuggingFace datasets. All commands are copy-paste ready.
+### Example 1: Question Answering (SQuAD)
+Stream Stanford Question Answering Dataset:
+```bash
+# Generate config
+hf2vespa init squad --output squad-config.yaml
+# Preview data structure
+hf2vespa feed squad --limit 3
+# Full streaming with custom doctype
+hf2vespa feed squad --doctype qa --namespace squad > squad-feed.jsonl
+```
+**Output format:** Each record contains `id`, `title`, `context`, `question`, `answers` fields.
+---
+### Example 2: Text Classification (GLUE)
+Stream GLUE benchmark tasks for NLU:
+```bash
+# MRPC (paraphrase detection)
+hf2vespa feed glue --config mrpc --limit 5
+# SST-2 (sentiment analysis)
+hf2vespa feed glue --config sst2 --namespace sentiment --limit 5
+# With column filtering (ax only has test split)
+hf2vespa feed glue --config ax --split test --include premise --include hypothesis
+```
+---
+### Example 3: Retrieval (MS MARCO)
+Stream MS MARCO passage retrieval dataset:
+```bash
+# Generate config to see structure
+hf2vespa init ms_marco --config v1.1 --output msmarco-config.yaml
+# Stream passages
+hf2vespa feed ms_marco --config v1.1 --doctype passage --limit 1000
+```
+---
+### Example 4: Wikipedia
+Stream Wikipedia articles:
+```bash
+# Check available configs (language editions)
+# Use 20220301.en for English Wikipedia snapshot
+hf2vespa init wikipedia --config 20220301.simple --output wiki-config.yaml
+hf2vespa feed wikipedia --config 20220301.simple --limit 100 --doctype article
+```
+Note: Full Wikipedia is large. Use `--limit` for testing.
+---
+### Example 5: Custom Embeddings Dataset
+For datasets with pre-computed embeddings:
+```yaml
+# embedding-config.yaml
+namespace: vectors
+doctype: document
+id_column: doc_id
+mappings:
+  - source: text
+    target: content
+    type: string
+  - source: embedding
+    target: vector
+    type: tensor
+```
+```bash
+hf2vespa feed your-embedding-dataset --config-file embedding-config.yaml
+```
+The `tensor` type converts Python lists to Vespa tensor format: `{"values": [0.1, 0.2, ...]}`
+---
+### Example 6: Hex-Encoded Embeddings (v2.0)
+For memory-efficient embedding storage, use hex-encoded tensors:
+```yaml
+# hex-embedding-config.yaml
+namespace: search
+doctype: document
+id_column: doc_id
+mappings:
+  - source: text
+    target: content
+    type: string
+  # Full precision (8 hex chars per value)
+  - source: embedding
+    target: vector_f32
+    type: tensor_float32_hex
+  # Quantized (2 hex chars per value, 4x smaller)
+  - source: quantized_embedding
+    target: vector_int8
+    type: tensor_int8_hex
+```
+```bash
+hf2vespa feed your-embedding-dataset --config-file hex-embedding-config.yaml
+```
+---
+### Example 7: ColBERT Multi-Vector Embeddings (v2.0)
+For ColBERT-style token-level embeddings:
+```yaml
+# colbert-config.yaml
+namespace: colbert
+doctype: passage
+id_column: passage_id
+mappings:
+  - source: text
+    target: content
+  # Token embeddings: {"token1": [0.1, 0.2, ...], "token2": [...]}
+  - source: token_embeddings
+    target: colbert_rep
+    type: mixed_tensor_hex  # Uses float32 hex by default
+```
+The `mixed_tensor_hex` type supports `cell_type` options: `int8`, `bfloat16`, `float32` (default), `float64`.
+---
+### Example 8: Geo and Weighted Data (v2.0)
+For location-aware search with term weights:
+```yaml
+# geo-weighted-config.yaml
+namespace: places
+doctype: venue
+mappings:
+  - source: name
+    target: title
+  # Geo coordinates for geo-search
+  - source: coordinates
+    target: location
+    type: position  # {"lat": 37.4, "lng": -122.0}
+  # Category weights for boosting
+  - source: categories
+    target: category_weights
+    type: weightedset  # {"restaurant": 10, "cafe": 5}
+```
+---
+### Piping to Vespa
+Stream directly to a Vespa instance:
+```bash
+# Using vespa-cli
+hf2vespa feed squad --limit 1000 | vespa feed -
+# Or save and feed later
+hf2vespa feed squad > feed.jsonl
+vespa feed feed.jsonl
+```
+## Type Reference (v2.0)
+Complete reference for all supported type converters.
+### Basic Types
+#### `string`
+Converts any value to string.
+```
+Input: 123 → Output: "123"
+```
+#### `int`
+Converts value to integer.
+```
+Input: "42" → Output: 42
+```
+#### `float`
+Converts value to float.
+```
+Input: "3.14" → Output: 3.14
+```
+#### `tensor`
+Converts list to Vespa indexed tensor (JSON array format).
+```
+Input: [0.1, 0.2, 0.3]
+Output: {"values": [0.1, 0.2, 0.3]}
+```
+### Hex-Encoded Tensors
+Memory-efficient tensor encoding for embeddings. Values are packed as binary and hex-encoded.
+#### `tensor_int8_hex`
+8-bit signed integers (-128 to 127). 2 hex chars per value.
+```
+Input: [11, 34, 3]
+Output: {"values": "0b2203"}
+```
+**Use case:** Quantized embeddings, reduced storage (4x smaller than float32).
+#### `tensor_bfloat16_hex`
+Brain floating point (truncated float32). 4 hex chars per value.
+```
+Input: [1.0, -1.0, 0.0]
+Output: {"values": "3f80bf800000"}
+```
+**Use case:** ML model weights, good range with reduced precision.
+#### `tensor_float32_hex`
+IEEE 754 single precision. 8 hex chars per value.
+```
+Input: [3.14159]
+Output: {"values": "40490fdb"}
+```
+**Use case:** Standard embedding precision.
+#### `tensor_float64_hex`
+IEEE 754 double precision. 16 hex chars per value.
+```
+Input: [3.141592653589793]
+Output: {"values": "400921fb54442d18"}
+```
+**Use case:** High-precision scientific data.
+### Scalar Types
+#### `position`
+Geo coordinates for location-based search.
+```
+Input: {"lat": 37.4, "lng": -122.0}
+Output: {"lat": 37.4, "lng": -122.0}
+```
+**Validation:** Latitude must be -90 to 90, longitude -180 to 180.
+#### `weightedset`
+Key-weight pairs for weighted search.
+```
+Input: {"tag1": 10, "tag2": 5}
+Output: {"tag1": 10, "tag2": 5}
+```
+**Note:** Keys are stringified, weights converted to integers.
+#### `map`
+Generic key-value maps.
+```
+Input: {1: "one", 2: "two"}
+Output: {"1": "one", "2": "two"}
+```
+**Note:** Keys are stringified.
+### Sparse and Mixed Tensors
+#### `sparse_tensor`
+Single mapped dimension using Vespa cells notation.
+```
+Input: {"word1": 0.8, "word2": 0.5}
+Output: {"cells": [{"address": {"key": "word1"}, "value": 0.8},
+                   {"address": {"key": "word2"}, "value": 0.5}]}
+```
+**Use case:** Term weights, feature importance scores.
+#### `mixed_tensor`
+Combined mapped + indexed dimensions using Vespa blocks notation.
+```
+Input: {"w1": [1.0, 2.0], "w2": [3.0, 4.0]}
+Output: {"blocks": {"w1": [1.0, 2.0], "w2": [3.0, 4.0]}}
+```
+**Use case:** ColBERT-style multi-vector embeddings.
+**Validation:** All block arrays must have the same length.
+#### `mixed_tensor_hex`
+Mixed tensor with hex-encoded dense dimensions.
+```
+Input: {"w1": [11, 34, 3], "w2": [-124, 5, -1]}  (with cell_type=int8)
+Output: {"blocks": {"w1": "0b2203", "w2": "8405ff"}}
+```
+**Cell types:** `int8`, `bfloat16`, `float32` (default), `float64`
+**Use case:** Memory-efficient ColBERT embeddings.
+## Troubleshooting
+### Authentication Errors
+**Symptom:** `401 Unauthorized` or `403 Forbidden` when accessing private/gated datasets
+**Cause:** HuggingFace authentication token not provided or invalid
+**Solution:**
+```bash
+# Option 1: Environment variable
+export HF_TOKEN=your_token_here
+hf2vespa feed your-private-dataset
+# Option 2: HuggingFace CLI login (persistent)
+pip install huggingface_hub
+huggingface-cli login
+```
+Get your token at: https://huggingface.co/settings/tokens
+---
+### Memory Issues with Large Datasets
+**Symptom:** Process killed, MemoryError, or system becomes unresponsive
+**Cause:** Dataset too large to fit in memory (this tool uses HF datasets streaming)
+**Solution:**
+```bash
+# Use --limit to process in batches
+hf2vespa feed large-dataset --limit 10000 > batch1.jsonl
+hf2vespa feed large-dataset --limit 10000 --skip 10000 > batch2.jsonl
+# Or pipe directly to Vespa (recommended)
+hf2vespa feed large-dataset | vespa feed -
+```
+Note: HuggingFace datasets library handles streaming efficiently. Memory issues are rare but can occur with very wide datasets (many columns) or large individual records.
+---
+### Type Conversion Errors
+**Symptom:** `TypeError` or `ValueError` during feed generation
+**Cause:** Column type doesn't match expected converter (e.g., tensor on non-list field)
+**Solution:**
+1. Check your YAML config mappings
+2. Verify the source column type with `init`:
+   ```bash
+   hf2vespa init your-dataset --config your-config
+   ```
+3. Match converter type to actual data:
+   - Use `tensor` only for list/sequence columns (embeddings)
+   - Use `string`, `int`, `float` for scalar values
+---
+### Multi-Config Dataset Errors
+**Symptom:** "This dataset has multiple configurations" error
+**Cause:** Dataset requires a `--config` argument (like `glue`, `super_glue`, etc.)
+**Solution:**
+```bash
+# List available configs (check HuggingFace dataset page)
+# Then specify one:
+hf2vespa feed glue --config ax
+hf2vespa init glue --config cola
+```
+---
+### Dataset Not Found
+**Symptom:** `DatasetNotFoundError` or 404 error
+**Cause:** Dataset name misspelled, private without auth, or doesn't exist
+**Solution:**
+1. Verify dataset exists on HuggingFace Hub
+2. Check spelling (case-sensitive)
+3. For private datasets, ensure HF_TOKEN is set (see Authentication Errors)
+## Contributing
+Issues and pull requests are welcome. Please open an issue to discuss major changes before submitting PRs.
+## License
+MIT License - see repository for details.