PyPI - dr-wandb - Versions diffs - 0.1.1__tar.gz → 0.1.2__tar.gz - Mend

dr-wandb 0.1.1tar.gz → 0.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (44) hide show

dr_wandb-0.1.2/.python-version ADDED Viewed

	@@ -0,0 +1 @@
1	+ 3.13

dr_wandb-0.1.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,179 @@
+Metadata-Version: 2.4
+Name: dr-wandb
+Version: 0.1.2
+Summary: Interact with wandb from python
+Author-email: Danielle Rothermel <danielle.rothermel@gmail.com>
+License-File: LICENSE
+Requires-Python: >=3.12
+Requires-Dist: pandas>=2.3.2
+Requires-Dist: pyarrow>=21.0.0
+Requires-Dist: sqlalchemy>=2.0.43
+Requires-Dist: typer>=0.20.0
+Requires-Dist: wandb>=0.21.4
+Description-Content-Type: text/markdown
+# dr_wandb
+A command-line utility for downloading and archiving Weights & Biases experiment data to local storage formats optimized for offline analysis.
+## Installation
+CLI Tool Install: `wandb-downloader`
+```
+uv tool install dr_wandb
+```
+Or, to use the library functions
+```bash
+# To use the library functions
+uv add dr_wandb
+# Optionally
+uv add dr_wandb[postgres]
+uv sync
+```
+### Authentication
+Configure Weights & Biases authentication using one of these methods:
+```bash
+wandb login
+```
+Or set the API key as an environment variable:
+```bash
+export WANDB_API_KEY=your_api_key_here
+```
+## Quickstart
+The default approach doesn't involve postgres. It fetches the runs, and optionally histories, and dumps them to local pkl files.
+```bash
+» wandb-download --help
+ Usage: wandb-download [OPTIONS] ENTITY PROJECT OUTPUT_DIR
+╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ *    entity          TEXT  [required]                                                                                                                            │
+│ *    project         TEXT  [required]                                                                                                                            │
+│ *    output_dir      TEXT  [required]                                                                                                                            │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ --runs-only             --no-runs-only             [default: no-runs-only]                                                                                       │
+│ --runs-per-page                           INTEGER  [default: 500]                                                                                                │
+│ --log-every                               INTEGER  [default: 20]                                                                                                 │
+│ --install-completion                               Install completion for the current shell.                                                                     │
+│ --show-completion                                  Show completion for the current shell, to copy it or customize the installation.                              │
+│ --help                                             Show this message and exit.                                                                                   │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+```
+An example:
+```bash
+» wandb-download --runs-only "ml-moe" "ft-scaling" "./data"                                          1 ↵
+2025-11-10 21:47:54 - INFO -
+:: Beginning Dr. Wandb Project Downloading Tool ::
+2025-11-10 21:47:54 - INFO - {
+    "entity": "ml-me",
+    "project": "scaling",
+    "output_dir": "data",
+    "runs_only": true,
+    "runs_per_page": 500,
+    "log_every": 20,
+    "runs_output_filename": "ml-me_scaling_runs.pkl",
+    "histories_output_filename": "ml-me_scaling_histories.pkl"
+}
+2025-11-10 21:47:54 - INFO -
+2025-11-10 21:47:54 - INFO - >> Downloading runs, this will take a while (minutes)
+wandb: Currently logged in as: danielle-rothermel (ml-moe) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
+2025-11-10 21:48:00 - INFO -   - total runs found: 517
+2025-11-10 21:48:00 - INFO - >> Serializing runs and maybe getting histories: False
+2025-11-10 21:48:07 - INFO - >> 20/517: 2025_08_21-08_24_43_test_finetune_DD-dolma1_7-10M_main_1Mtx1_--learning_rate=5e-05
+2025-11-10 21:48:12 - INFO - >> 40/517: 2025_08_21-08_24_43_test_finetune_DD-dolma1_7-150M_main_10Mtx1_--learning_rate=5e-06
+...
+2025-11-10 21:50:46 - INFO - >> Dumped runs data to: ./data/ml-moe_ft-scaling_runs.pkl
+2025-11-10 21:50:46 - INFO - >> Runs only, not dumping histories to: ./data/ml-moe_ft-scaling_histories.pkl
+```
+## Very Alpha: Postgres Version
+**Its very likely this won't currently work.**  Download all runs from a Weights & Biases project:
+```bash
+uv run python src/dr_wandb/cli/postres_download.py --entity your_entity --project your_project
+Options:
+  --entity TEXT        WandB entity (username or team name)
+  --project TEXT       WandB project name
+  --runs-only          Download only run metadata, skip training history
+  --force-refresh      Download all data, ignoring existing records
+  --db-url TEXT        PostgreSQL connection string
+  --output-dir TEXT    Directory for exported Parquet files
+  --help              Show help message and exit
+```
+The tool creates a PostgreSQL database, downloads experiment data, and exports Parquet files to the configured output directory. It tool tracks existing data and downloads only new or updated runs by default. A run is considered for update if:
+- It does not exist in the local database
+- Its state is "running" (indicating potential new data)
+Use `--force-refresh` to download all runs regardless of existing data.
+### Environment Variables
+The tool reads configuration from environment variables with the `DR_WANDB_` prefix and supports `.env` files:
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `DR_WANDB_ENTITY` | Weights & Biases entity name | None |
+| `DR_WANDB_PROJECT` | Weights & Biases project name | None |
+| `DR_WANDB_DATABASE_URL` | PostgreSQL connection string | `postgresql+psycopg2://localhost/wandb` |
+| `DR_WANDB_OUTPUT_DIR` | Directory for exported files | `./data` |
+### Database Configuration
+The PostgreSQL connection string follows the standard format:
+```
+postgresql+psycopg2://username:password@host:port/database_name
+```
+If the specified database does not exist, the tool will attempt to create it automatically.
+### Data Schema
+The tool generates the following files in the output directory:
+- `runs_metadata.parquet` - Complete run metadata including configurations, summaries, and system information
+- `runs_history.parquet` - Training metrics and logged values over time
+- `runs_metadata_{component}.parquet` - Component-specific files for config, summary, wandb_metadata, system_metrics, system_attrs, and sweep_info
+**Run Records**
+- **run_id**: Unique identifier for the experiment run
+- **run_name**: Human-readable name assigned to the run
+- **state**: Current state (finished, running, crashed, failed, killed)
+- **project**: Project name
+- **entity**: Entity name
+- **created_at**: Timestamp of run creation
+- **config**: Experiment configuration parameters (JSONB)
+- **summary**: Final metrics and outputs (JSONB)
+- **wandb_metadata**: Platform-specific metadata (JSONB)
+- **system_metrics**: Hardware and system information (JSONB)
+- **system_attrs**: Additional system attributes (JSONB)
+- **sweep_info**: Hyperparameter sweep information (JSONB)
+**Training History Records**
+- **run_id**: Reference to the parent run
+- **step**: Training step number
+- **timestamp**: Time of metric logging
+- **runtime**: Elapsed time since run start
+- **wandb_metadata**: Platform logging metadata (JSONB)
+- **metrics**: All logged metrics and values (JSONB, flattened in Parquet export)

dr_wandb-0.1.2/README.md ADDED Viewed

@@ -0,0 +1,165 @@
+# dr_wandb
+A command-line utility for downloading and archiving Weights & Biases experiment data to local storage formats optimized for offline analysis.
+## Installation
+CLI Tool Install: `wandb-downloader`
+```
+uv tool install dr_wandb
+```
+Or, to use the library functions
+```bash
+# To use the library functions
+uv add dr_wandb
+# Optionally
+uv add dr_wandb[postgres]
+uv sync
+```
+### Authentication
+Configure Weights & Biases authentication using one of these methods:
+```bash
+wandb login
+```
+Or set the API key as an environment variable:
+```bash
+export WANDB_API_KEY=your_api_key_here
+```
+## Quickstart
+The default approach doesn't involve postgres. It fetches the runs, and optionally histories, and dumps them to local pkl files.
+```bash
+» wandb-download --help
+ Usage: wandb-download [OPTIONS] ENTITY PROJECT OUTPUT_DIR
+╭─ Arguments ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ *    entity          TEXT  [required]                                                                                                                            │
+│ *    project         TEXT  [required]                                                                                                                            │
+│ *    output_dir      TEXT  [required]                                                                                                                            │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+╭─ Options ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
+│ --runs-only             --no-runs-only             [default: no-runs-only]                                                                                       │
+│ --runs-per-page                           INTEGER  [default: 500]                                                                                                │
+│ --log-every                               INTEGER  [default: 20]                                                                                                 │
+│ --install-completion                               Install completion for the current shell.                                                                     │
+│ --show-completion                                  Show completion for the current shell, to copy it or customize the installation.                              │
+│ --help                                             Show this message and exit.                                                                                   │
+╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
+```
+An example:
+```bash
+» wandb-download --runs-only "ml-moe" "ft-scaling" "./data"                                          1 ↵
+2025-11-10 21:47:54 - INFO -
+:: Beginning Dr. Wandb Project Downloading Tool ::
+2025-11-10 21:47:54 - INFO - {
+    "entity": "ml-me",
+    "project": "scaling",
+    "output_dir": "data",
+    "runs_only": true,
+    "runs_per_page": 500,
+    "log_every": 20,
+    "runs_output_filename": "ml-me_scaling_runs.pkl",
+    "histories_output_filename": "ml-me_scaling_histories.pkl"
+}
+2025-11-10 21:47:54 - INFO -
+2025-11-10 21:47:54 - INFO - >> Downloading runs, this will take a while (minutes)
+wandb: Currently logged in as: danielle-rothermel (ml-moe) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
+2025-11-10 21:48:00 - INFO -   - total runs found: 517
+2025-11-10 21:48:00 - INFO - >> Serializing runs and maybe getting histories: False
+2025-11-10 21:48:07 - INFO - >> 20/517: 2025_08_21-08_24_43_test_finetune_DD-dolma1_7-10M_main_1Mtx1_--learning_rate=5e-05
+2025-11-10 21:48:12 - INFO - >> 40/517: 2025_08_21-08_24_43_test_finetune_DD-dolma1_7-150M_main_10Mtx1_--learning_rate=5e-06
+...
+2025-11-10 21:50:46 - INFO - >> Dumped runs data to: ./data/ml-moe_ft-scaling_runs.pkl
+2025-11-10 21:50:46 - INFO - >> Runs only, not dumping histories to: ./data/ml-moe_ft-scaling_histories.pkl
+```
+## Very Alpha: Postgres Version
+**Its very likely this won't currently work.**  Download all runs from a Weights & Biases project:
+```bash
+uv run python src/dr_wandb/cli/postres_download.py --entity your_entity --project your_project
+Options:
+  --entity TEXT        WandB entity (username or team name)
+  --project TEXT       WandB project name
+  --runs-only          Download only run metadata, skip training history
+  --force-refresh      Download all data, ignoring existing records
+  --db-url TEXT        PostgreSQL connection string
+  --output-dir TEXT    Directory for exported Parquet files
+  --help              Show help message and exit
+```
+The tool creates a PostgreSQL database, downloads experiment data, and exports Parquet files to the configured output directory. It tool tracks existing data and downloads only new or updated runs by default. A run is considered for update if:
+- It does not exist in the local database
+- Its state is "running" (indicating potential new data)
+Use `--force-refresh` to download all runs regardless of existing data.
+### Environment Variables
+The tool reads configuration from environment variables with the `DR_WANDB_` prefix and supports `.env` files:
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `DR_WANDB_ENTITY` | Weights & Biases entity name | None |
+| `DR_WANDB_PROJECT` | Weights & Biases project name | None |
+| `DR_WANDB_DATABASE_URL` | PostgreSQL connection string | `postgresql+psycopg2://localhost/wandb` |
+| `DR_WANDB_OUTPUT_DIR` | Directory for exported files | `./data` |
+### Database Configuration
+The PostgreSQL connection string follows the standard format:
+```
+postgresql+psycopg2://username:password@host:port/database_name
+```
+If the specified database does not exist, the tool will attempt to create it automatically.
+### Data Schema
+The tool generates the following files in the output directory:
+- `runs_metadata.parquet` - Complete run metadata including configurations, summaries, and system information
+- `runs_history.parquet` - Training metrics and logged values over time
+- `runs_metadata_{component}.parquet` - Component-specific files for config, summary, wandb_metadata, system_metrics, system_attrs, and sweep_info
+**Run Records**
+- **run_id**: Unique identifier for the experiment run
+- **run_name**: Human-readable name assigned to the run
+- **state**: Current state (finished, running, crashed, failed, killed)
+- **project**: Project name
+- **entity**: Entity name
+- **created_at**: Timestamp of run creation
+- **config**: Experiment configuration parameters (JSONB)
+- **summary**: Final metrics and outputs (JSONB)
+- **wandb_metadata**: Platform-specific metadata (JSONB)
+- **system_metrics**: Hardware and system information (JSONB)
+- **system_attrs**: Additional system attributes (JSONB)
+- **sweep_info**: Hyperparameter sweep information (JSONB)
+**Training History Records**
+- **run_id**: Reference to the parent run
+- **step**: Training step number
+- **timestamp**: Time of metric logging
+- **runtime**: Elapsed time since run start
+- **wandb_metadata**: Platform logging metadata (JSONB)
+- **metrics**: All logged metrics and values (JSONB, flattened in Parquet export)

{dr_wandb-0.1.1 → dr_wandb-0.1.2}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "dr-wandb"
-version = "0.1.1"
+version = "0.1.2"
 description = "Interact with wandb from python"
 readme = "README.md"
 authors = [
@@ -9,15 +9,14 @@ authors = [
 requires-python = ">=3.12"
 dependencies = [
     "pandas>=2.3.2",
-    "psycopg2>=2.9.10",
     "pyarrow>=21.0.0",
-    "pydantic-settings>=2.10.1",
-    "sqlalchemy>=2.0.43",
+    "typer>=0.20.0",
     "wandb>=0.21.4",
+    "sqlalchemy>=2.0.43",
 ]
 [project.scripts]
-wandb-download = "dr_wandb.cli.download:download_project"
+wandb-download = "dr_wandb.cli.download:app"
 [build-system]
 requires = ["hatchling"]
@@ -26,14 +25,25 @@ build-backend = "hatchling.build"
 [tool.hatch.metadata]
 allow-direct-references = true
+[tool.hatch.build]
+exclude = [
+  "docs",
+  "docs/**",
+]
 [tool.hatch.build.targets.wheel]
 packages = ["src/dr_wandb"]
 [dependency-groups]
 dev = [
+    "dr-wandb",
     "pytest>=8.4.1",
 ]
+postgres = [
+    "pydantic-settings>=2.10.1",
+]
 [tool.ruff]
 include = [
@@ -154,3 +164,6 @@ seaborn             = "sns"
 polars              = "pl"
 lightning			= "L"
 "jax.numpy" 	    = "jnp"
+[tool.uv.sources]
+dr-wandb = { workspace = true }

dr_wandb-0.1.2/src/dr_wandb/cli/download.py ADDED Viewed

@@ -0,0 +1,97 @@
+from typing import Any
+import logging
+from pydantic import BaseModel, Field, computed_field
+from pathlib import Path
+import typer
+import pickle
+from dr_wandb.fetch import fetch_project_runs
+app = typer.Typer()
+class ProjDownloadConfig(BaseModel):
+    entity: str
+    project: str
+    output_dir: Path = Field(
+        default_factory=lambda: (
+            Path(__file__).parent.parent.parent.parent / "data"
+        )
+    )
+    runs_only: bool = False
+    runs_per_page: int = 500
+    log_every: int = 20
+    runs_output_filename: str = Field(
+        default_factory=lambda data: (
+            f"{data['entity']}_{data['project']}_runs.pkl"
+        )
+    )
+    histories_output_filename: str = Field(
+        default_factory=lambda data: (
+            f"{data['entity']}_{data['project']}_histories.pkl"
+        )
+    )
+    def progress_callback(self, run_index: int, total_runs: int, message: str)-> None:
+        if run_index % self.log_every == 0:
+            logging.info(f">> {run_index}/{total_runs}: {message}")
+    @computed_field
+    @property
+    def fetch_runs_cfg(self) -> dict[str, Any]:
+        return {
+            "entity": self.entity,
+            "project": self.project,
+            "runs_per_page": self.runs_per_page,
+            "progress_callback": self.progress_callback,
+            "include_history": not self.runs_only,
+        }
+def setup_logging(level: str = "INFO") -> None:
+    logging.basicConfig(
+        level=getattr(logging, level.upper()),
+        format="%(asctime)s - %(levelname)s - %(message)s",
+        datefmt="%Y-%m-%d %H:%M:%S",
+    )
+@app.command()
+def download_project(
+    entity: str,
+    project: str,
+    output_dir: str,
+    runs_only: bool = False,
+    runs_per_page: int = 500,
+    log_every: int = 20,
+) -> None:
+    setup_logging()
+    logging.info("\n:: Beginning Dr. Wandb Project Downloading Tool ::\n")
+    cfg = ProjDownloadConfig(
+        entity=entity,
+        project=project,
+        output_dir=output_dir,
+        runs_only=runs_only,
+        runs_per_page=runs_per_page,
+        log_every=log_every,
+    )
+    logging.info(str(cfg.model_dump_json(indent=4, exclude="fetch_runs_cfg")))
+    logging.info("")
+    runs, histories = fetch_project_runs(**cfg.fetch_runs_cfg)
+    runs_filename = f"{output_dir}/{cfg.runs_output_filename}"
+    histories_filename = f"{output_dir}/{cfg.histories_output_filename}"
+    with open(runs_filename, 'wb') as run_file:
+        pickle.dump(runs, run_file)
+    logging.info(f">> Dumped runs data to: {runs_filename}")
+    if not cfg.runs_only:
+        with open(histories_filename, 'wb') as hist_file:
+            pickle.dump(histories, hist_file)
+        logging.info(f">> Dumped histories data to: {histories_filename}")
+    else:
+        logging.info(f">> Runs only, not dumping histories to: {histories_filename}")
+if __name__ == "__main__":
+    app()

{dr_wandb-0.1.1 → dr_wandb-0.1.2}/src/dr_wandb/constants.py RENAMED Viewed

@@ -1,6 +1,7 @@
 from collections.abc import Callable
 from typing import Literal
+from sqlalchemy import String
 from sqlalchemy.orm import DeclarativeBase
@@ -17,4 +18,6 @@ WANDB_RUN_STATES = ["finished", "running", "crashed", "failed", "killed"]
 type RunState = Literal["finished", "running", "crashed", "failed", "killed"]
 type RunId = str
+Base.type_annotation_map = {RunId: String}
 type ProgressCallback = Callable[[int, int, str], None]

{dr_wandb-0.1.1 → dr_wandb-0.1.2}/src/dr_wandb/fetch.py RENAMED Viewed

@@ -6,6 +6,7 @@ from collections.abc import Callable, Iterator
 from typing import Any
 import wandb
+import logging
 from dr_wandb.history_entry_record import HistoryEntryRecord
 from dr_wandb.run_record import RunRecord
@@ -62,9 +63,12 @@ def fetch_project_runs(
     runs: list[dict[str, Any]] = []
     histories: list[list[dict[str, Any]]] = []
+    logging.info(">> Downloading runs, this will take a while (minutes)")
     run_iter = list(_iterate_runs(entity, project, runs_per_page=runs_per_page))
     total = len(run_iter)
+    logging.info(f"  - total runs found: {total}")
+    logging.info(f">> Serializing runs and maybe getting histories: {include_history}")
     for index, run in enumerate(run_iter, start=1):
         runs.append(serialize_run(run))
         if include_history:

dr-wandb 0.1.1__tar.gz → 0.1.2__tar.gz

dr-wandb 0.1.1tar.gz → 0.1.2tar.gz