PyPI - faceberg - Versions diffs - 0.1.0__tar.gz → 0.1.2__tar.gz - Mend

faceberg 0.1.0tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

faceberg-0.1.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,149 @@
+Metadata-Version: 2.4
+Name: faceberg
+Version: 0.1.2
+Summary: Bridge HuggingFace datasets with Apache Iceberg
+Project-URL: Homepage, https://github.com/kszucs/faceberg
+Project-URL: Documentation, https://github.com/kszucs/faceberg
+Project-URL: Repository, https://github.com/kszucs/faceberg
+Author-email: Krisztian Szucs <kszucs@users.noreply.github.com>
+License: Apache-2.0
+License-File: LICENSE
+Keywords: data-lake,datasets,huggingface,iceberg
+Classifier: Development Status :: 3 - Alpha
+Classifier: Intended Audience :: Developers
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.9
+Requires-Dist: click>=8.0.0
+Requires-Dist: datasets>=2.0.0
+Requires-Dist: fsspec>=2023.1.0
+Requires-Dist: huggingface-hub>=0.20.0
+Requires-Dist: jinja2>=3.1.6
+Requires-Dist: litestar>=2.0.0
+Requires-Dist: pyarrow>=21.0.0
+Requires-Dist: pyiceberg>=0.10.0
+Requires-Dist: pyyaml>=6.0
+Requires-Dist: rich>=13.0.0
+Requires-Dist: uuid-utils>=0.9.0
+Requires-Dist: uvicorn[standard]>=0.27.0
+Provides-Extra: dev
+Requires-Dist: black>=23.0.0; extra == 'dev'
+Requires-Dist: duckdb>=0.10.0; extra == 'dev'
+Requires-Dist: mypy>=1.0.0; extra == 'dev'
+Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
+Requires-Dist: pytest-playwright>=0.7.0; extra == 'dev'
+Requires-Dist: pytest>=7.0.0; extra == 'dev'
+Requires-Dist: requests>=2.31.0; extra == 'dev'
+Requires-Dist: ruff>=0.1.0; extra == 'dev'
+Description-Content-Type: text/markdown
+![Faceberg](https://github.com/kszucs/faceberg/blob/main/faceberg.png?raw=true)
+# Faceberg
+**Bridge HuggingFace datasets with Apache Iceberg tables — no data copying, just metadata.**
+Faceberg maps HuggingFace datasets to Apache Iceberg tables. Your catalog metadata lives on HuggingFace Spaces with an auto-deployed REST API, and any Iceberg-compatible query engine can access the data.
+## Installation
+```bash
+pip install faceberg
+```
+## Quick Start
+```bash
+export HF_TOKEN=your_huggingface_token
+# Create a catalog on HuggingFace Hub
+faceberg user/mycatalog init
+# Add datasets
+faceberg user/mycatalog add stanfordnlp/imdb --config plain_text
+faceberg user/mycatalog add openai/gsm8k --config main
+# Query with interactive DuckDB shell
+faceberg user/mycatalog quack
+```
+```sql
+SELECT label, substr(text, 1, 100) as preview
+FROM iceberg_catalog.stanfordnlp.imdb
+LIMIT 10;
+```
+## How It Works
+```
+HuggingFace Hub
+┌─────────────────────────────────────────────────────────┐
+│                                                         │
+│  ┌─────────────────────┐    ┌─────────────────────────┐ │
+│  │  HF Datasets        │    │  HF Spaces (Catalog)    │ │
+│  │  (Original Parquet) │◄───│  • Iceberg metadata     │ │
+│  │                     │    │  • REST API endpoint    │ │
+│  │  stanfordnlp/imdb/  │    │  • faceberg.yml         │ │
+│  │   └── *.parquet     │    │                         │ │
+│  └─────────────────────┘    └───────────┬─────────────┘ │
+│                                         │               │
+└─────────────────────────────────────────┼───────────────┘
+                                          │ Iceberg REST API
+                                          ▼
+                              ┌─────────────────────────┐
+                              │     Query Engines       │
+                              │  DuckDB, Pandas, Spark  │
+                              └─────────────────────────┘
+```
+**No data is copied** — only metadata is created. Query with DuckDB, PyIceberg, Spark, or any Iceberg-compatible tool.
+## Python API
+```python
+import os
+from faceberg import catalog
+cat = catalog("user/mycatalog", hf_token=os.environ.get("HF_TOKEN"))
+table = cat.load_table("stanfordnlp.imdb")
+df = table.scan(limit=100).to_pandas()
+```
+## Share Your Catalog
+Your catalog is accessible to anyone via the REST API:
+```python
+import duckdb
+conn = duckdb.connect()
+conn.execute("INSTALL iceberg; LOAD iceberg")
+conn.execute("ATTACH 'https://user-mycatalog.hf.space' AS cat (TYPE ICEBERG)")
+result = conn.execute("SELECT * FROM cat.stanfordnlp.imdb LIMIT 5").fetchdf()
+```
+## Documentation
+**[Read the docs →](https://faceberg.kszucs.dev/)**
+- [Getting Started](https://faceberg.kszucs.dev/) — Full quickstart guide
+- [Local Catalogs](https://faceberg.kszucs.dev/local.html) — Use local catalogs for development
+- [DuckDB Integration](https://faceberg.kszucs.dev/integrations/duckdb.html) — Advanced SQL queries
+- [Pandas Integration](https://faceberg.kszucs.dev/integrations/pandas.html) — Load into DataFrames
+## Development
+```bash
+git clone https://github.com/kszucs/faceberg
+cd faceberg
+pip install -e .
+```
+## License
+Apache 2.0

faceberg-0.1.2/README.md ADDED Viewed

@@ -0,0 +1,106 @@
+![Faceberg](https://github.com/kszucs/faceberg/blob/main/faceberg.png?raw=true)
+# Faceberg
+**Bridge HuggingFace datasets with Apache Iceberg tables — no data copying, just metadata.**
+Faceberg maps HuggingFace datasets to Apache Iceberg tables. Your catalog metadata lives on HuggingFace Spaces with an auto-deployed REST API, and any Iceberg-compatible query engine can access the data.
+## Installation
+```bash
+pip install faceberg
+```
+## Quick Start
+```bash
+export HF_TOKEN=your_huggingface_token
+# Create a catalog on HuggingFace Hub
+faceberg user/mycatalog init
+# Add datasets
+faceberg user/mycatalog add stanfordnlp/imdb --config plain_text
+faceberg user/mycatalog add openai/gsm8k --config main
+# Query with interactive DuckDB shell
+faceberg user/mycatalog quack
+```
+```sql
+SELECT label, substr(text, 1, 100) as preview
+FROM iceberg_catalog.stanfordnlp.imdb
+LIMIT 10;
+```
+## How It Works
+```
+HuggingFace Hub
+┌─────────────────────────────────────────────────────────┐
+│                                                         │
+│  ┌─────────────────────┐    ┌─────────────────────────┐ │
+│  │  HF Datasets        │    │  HF Spaces (Catalog)    │ │
+│  │  (Original Parquet) │◄───│  • Iceberg metadata     │ │
+│  │                     │    │  • REST API endpoint    │ │
+│  │  stanfordnlp/imdb/  │    │  • faceberg.yml         │ │
+│  │   └── *.parquet     │    │                         │ │
+│  └─────────────────────┘    └───────────┬─────────────┘ │
+│                                         │               │
+└─────────────────────────────────────────┼───────────────┘
+                                          │ Iceberg REST API
+                                          ▼
+                              ┌─────────────────────────┐
+                              │     Query Engines       │
+                              │  DuckDB, Pandas, Spark  │
+                              └─────────────────────────┘
+```
+**No data is copied** — only metadata is created. Query with DuckDB, PyIceberg, Spark, or any Iceberg-compatible tool.
+## Python API
+```python
+import os
+from faceberg import catalog
+cat = catalog("user/mycatalog", hf_token=os.environ.get("HF_TOKEN"))
+table = cat.load_table("stanfordnlp.imdb")
+df = table.scan(limit=100).to_pandas()
+```
+## Share Your Catalog
+Your catalog is accessible to anyone via the REST API:
+```python
+import duckdb
+conn = duckdb.connect()
+conn.execute("INSTALL iceberg; LOAD iceberg")
+conn.execute("ATTACH 'https://user-mycatalog.hf.space' AS cat (TYPE ICEBERG)")
+result = conn.execute("SELECT * FROM cat.stanfordnlp.imdb LIMIT 5").fetchdf()
+```
+## Documentation
+**[Read the docs →](https://faceberg.kszucs.dev/)**
+- [Getting Started](https://faceberg.kszucs.dev/) — Full quickstart guide
+- [Local Catalogs](https://faceberg.kszucs.dev/local.html) — Use local catalogs for development
+- [DuckDB Integration](https://faceberg.kszucs.dev/integrations/duckdb.html) — Advanced SQL queries
+- [Pandas Integration](https://faceberg.kszucs.dev/integrations/pandas.html) — Load into DataFrames
+## Development
+```bash
+git clone https://github.com/kszucs/faceberg
+cd faceberg
+pip install -e .
+```
+## License
+Apache 2.0

faceberg-0.1.2/faceberg/_version.py ADDED Viewed

@@ -0,0 +1,34 @@
+# file generated by setuptools-scm
+# don't change, don't track in version control
+__all__ = [
+    "__version__",
+    "__version_tuple__",
+    "version",
+    "version_tuple",
+    "__commit_id__",
+    "commit_id",
+]
+TYPE_CHECKING = False
+if TYPE_CHECKING:
+    from typing import Tuple
+    from typing import Union
+    VERSION_TUPLE = Tuple[Union[int, str], ...]
+    COMMIT_ID = Union[str, None]
+else:
+    VERSION_TUPLE = object
+    COMMIT_ID = object
+version: str
+__version__: str
+__version_tuple__: VERSION_TUPLE
+version_tuple: VERSION_TUPLE
+commit_id: COMMIT_ID
+__commit_id__: COMMIT_ID
+__version__ = version = '0.1.2'
+__version_tuple__ = version_tuple = (0, 1, 2)
+__commit_id__ = commit_id = None

{faceberg-0.1.0 → faceberg-0.1.2}/faceberg/catalog.py RENAMED Viewed

@@ -4,7 +4,6 @@ import logging
 import os
 import shutil
 import tempfile
-import uuid
 from contextlib import contextmanager
 from pathlib import Path
 from typing import TYPE_CHECKING, Any, Callable, List, Optional, Set, Union
@@ -20,7 +19,7 @@ from pyiceberg.exceptions import (
     NoSuchTableError,
     TableAlreadyExistsError,
 )
-from pyiceberg.io import FileIO
+from pyiceberg.io import FileIO, load_file_io
 from pyiceberg.io.fsspec import FsspecFileIO
 from pyiceberg.partitioning import UNPARTITIONED_PARTITION_SPEC, PartitionKey, PartitionSpec
 from pyiceberg.schema import Schema
@@ -34,8 +33,8 @@ from pyiceberg.typedef import EMPTY_DICT, Properties
 from uuid_utils import uuid7
 from . import config as cfg
-from .bridge import DatasetInfo
-from .convert import IcebergMetadataWriter
+from .discover import discover_dataset
+from .iceberg import write_snapshot
 if TYPE_CHECKING:
     import pyarrow as pa
@@ -361,8 +360,6 @@ class BaseCatalog(Catalog):
         Returns:
             FileIO instance with authentication configured
         """
-        from pyiceberg.io import load_file_io
         # Start with catalog's persisted properties
         props = dict(self.properties)
         # Add runtime-only token if available
@@ -956,72 +953,82 @@ class BaseCatalog(Catalog):
                 identifier, state="in_progress", percent=0, stage="Discovering dataset"
             )
-        dataset_info = DatasetInfo.discover(
+        dataset_info = discover_dataset(
             repo_id=repo,
             config=config,
             token=self._hf_token,
         )
-        # Convert to TableInfo
+        # Prepare schema with split column
         if progress_callback:
-            progress_callback(identifier, state="in_progress", percent=0, stage="Converting schema")
+            progress_callback(
+                identifier, state="in_progress", percent=10, stage="Converting schema"
+            )
+        if not dataset_info.files:
+            raise ValueError(f"No Parquet files found in dataset {repo}")
-        # TODO(kszucs): support nested namespace, pass identifier to to_table_info
-        namespace, table_name = identifier
-        table_info = dataset_info.to_table_info(
-            namespace=namespace,
-            table_name=table_name,
+        # Convert HuggingFace features to Arrow schema
+        arrow_schema = dataset_info.features.arrow_schema
+        # Build table properties
+        data_path = (
+            f"hf://datasets/{repo}/{dataset_info.data_dir}"
+            if dataset_info.data_dir
+            else f"hf://datasets/{repo}"
         )
-        # Create the table with full metadata in staging context
+        properties = {
+            "format-version": "2",
+            "write.parquet.compression-codec": "snappy",
+            "write.py-location-provider.impl": "faceberg.catalog.HfLocationProvider",
+            "write.data.path": data_path,
+            "hf.dataset.repo": repo,
+            "hf.dataset.config": config,
+            "hf.dataset.revision": dataset_info.revision,
+            "hf.write.pattern": "{split}-{uuid}-iceberg.parquet",
+            "hf.write.split": "train",
+        }
+        # Write Iceberg metadata
         if progress_callback:
             progress_callback(
-                identifier, state="in_progress", percent=0, stage="Writing Iceberg metadata"
+                identifier, state="in_progress", percent=20, stage="Writing Iceberg metadata"
             )
         with self._staging() as staging:
-            # Define table directory in the staging area
-            # Note: IcebergMetadataWriter will create the metadata subdirectory
-            table_dir = staging / identifier.path
-            table_dir.mkdir(parents=True, exist_ok=True)
             # Create table URI for metadata
             table_uri = self.uri / identifier.path
-            # Create metadata writer
-            metadata_writer = IcebergMetadataWriter(
-                table_path=table_dir,
-                schema=table_info.schema,
-                partition_spec=table_info.partition_spec,
-                base_uri=table_uri,
-            )
+            # Load FileIO with HuggingFace support
+            io = self._load_file_io(location=str(table_uri))
-            # Generate table UUID
-            table_uuid = str(uuid.uuid4())
-            # Write Iceberg metadata files (manifest, manifest list, table metadata)
-            metadata_writer.create_metadata_from_files(
-                file_infos=table_info.data_files,
-                table_uuid=table_uuid,
-                properties=table_info.get_table_properties(),
-                progress_callback=progress_callback,
-                identifier=identifier,
+            # Write snapshot metadata with split column
+            write_snapshot(
+                files=dataset_info.files,
+                schema=arrow_schema,
+                current_metadata=None,
+                output_dir=staging / identifier.path,
+                base_uri=str(table_uri),
+                properties=properties,
+                include_split_column=True,
+                io=io,
             )
-            # TODO(kszucs): metadata writer should return with the affected file paths
-            # Record all created files in the table directory
+            # Record all created files in the table metadata directory
             if progress_callback:
                 progress_callback(identifier, state="in_progress", percent=90, stage="Finalizing")
-            for path in table_dir.rglob("*"):
+            metadata_dir = staging / identifier.path / "metadata"
+            for path in metadata_dir.rglob("*"):
                 if path.is_file():
                     staging.add(path.relative_to(staging.path))
             # Register table in config if not already there
             if identifier not in catalog_config:
                 catalog_config[identifier] = cfg.Dataset(
-                    repo=table_info.dataset_repo,
-                    config=table_info.dataset_config,
+                    repo=repo,
+                    config=config,
                 )
                 # Save config since we added a dataset table
                 catalog_config.to_yaml(staging / "faceberg.yml")
@@ -1109,16 +1116,17 @@ class BaseCatalog(Catalog):
                 "Please recreate the table to enable incremental sync."
             )
-        # Discover dataset at current revision with only new files since old_revision
-        dataset_info = DatasetInfo.discover(
+        # Discover dataset at current revision
+        # Note: The new discover_dataset() doesn't support since_revision filtering yet
+        # So we discover all files and write_snapshot() will handle the diff
+        dataset_info = discover_dataset(
             repo_id=table_entry.repo,
             config=table_entry.config,
             token=self._hf_token,
-            since_revision=old_revision,
         )
-        # Check if already up to date (no new files)
-        if not dataset_info.data_files:
+        # Check if already up to date (same revision)
+        if dataset_info.revision == old_revision:
             logger.info(f"Table {identifier} already at revision {old_revision}")
             if progress_callback:
                 progress_callback(
@@ -1126,43 +1134,51 @@ class BaseCatalog(Catalog):
                 )
             return table
-        # Convert to TableInfo with only new files
-        # TODO(kszucs): support nested namespace, pass identifier to to_table_info
-        table_info = dataset_info.to_table_info(
-            namespace=identifier[0],
-            table_name=identifier[1],
+        # Use existing table schema - don't modify it
+        # The schema was already set correctly when the table was created
+        # Build updated properties
+        data_path = (
+            f"hf://datasets/{table_entry.repo}/{dataset_info.data_dir}"
+            if dataset_info.data_dir
+            else f"hf://datasets/{table_entry.repo}"
         )
-        # If no new files, table is already up to date
-        if not table_info.data_files:
-            logger.info(f"No new files for {identifier}")
-            return table
+        properties = {
+            "format-version": "2",
+            "write.parquet.compression-codec": "snappy",
+            "write.py-location-provider.impl": "faceberg.catalog.HfLocationProvider",
+            "write.data.path": data_path,
+            "hf.dataset.repo": table_entry.repo,
+            "hf.dataset.config": table_entry.config,
+            "hf.dataset.revision": dataset_info.revision,
+            "hf.write.pattern": "{split}-{uuid}-iceberg.parquet",
+            "hf.write.split": "train",
+        }
-        # Append new snapshot with only new files
+        # Append new snapshot with all files (write_snapshot will handle diffing)
         with self._staging() as staging:
-            # Create local metadata directory
-            metadata_dir = staging / identifier.path / "metadata"
-            metadata_dir.mkdir(parents=True, exist_ok=True)
             # Create table URI for metadata
-            table_uri = self.uri / identifier.path.path
-            # Create metadata writer
-            metadata_writer = IcebergMetadataWriter(
-                table_path=metadata_dir,
-                schema=table_info.schema,
-                partition_spec=table_info.partition_spec,
-                base_uri=table_uri,
-            )
+            table_uri = self.uri / identifier.path
-            # Append new snapshot with updated files
-            metadata_writer.append_snapshot_from_files(
-                file_infos=table_info.data_files,
+            # Load FileIO with HuggingFace support
+            io = self._load_file_io(location=str(table_uri))
+            # Write new snapshot (will diff against current_metadata)
+            # Schema and include_split_column parameters are ignored when current_metadata exists
+            # - it uses current_metadata.schema() and current_metadata.spec()
+            write_snapshot(
+                files=dataset_info.files,
+                schema=dataset_info.features.arrow_schema,  # Only used if creating new table
                 current_metadata=table.metadata,
-                properties=table_info.get_table_properties(),
+                output_dir=staging / identifier.path,
+                base_uri=str(table_uri),
+                properties=properties,
+                io=io,
             )
-            # Record all files in the table directory (including new manifest/metadata files)
+            # Record all files in the metadata directory (including new manifest/metadata files)
+            metadata_dir = staging / identifier.path / "metadata"
             for path in metadata_dir.rglob("*"):
                 if path.is_file():
                     staging.add(path.relative_to(staging.path))

faceberg 0.1.0__tar.gz → 0.1.2__tar.gz

faceberg 0.1.0tar.gz → 0.1.2tar.gz