strands-transformers 0.2.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- strands_transformers-0.2.0/.github/workflows/docs.yml +59 -0
- strands_transformers-0.2.0/.github/workflows/release.yml +80 -0
- strands_transformers-0.2.0/.gitignore +13 -0
- strands_transformers-0.2.0/ARCHITECTURE.md +93 -0
- strands_transformers-0.2.0/PKG-INFO +252 -0
- strands_transformers-0.2.0/README.md +203 -0
- strands_transformers-0.2.0/agent.py +68 -0
- strands_transformers-0.2.0/docs/assets/audio/omni_speak.wav +0 -0
- strands_transformers-0.2.0/docs/assets/audio/tts_hello.wav +0 -0
- strands_transformers-0.2.0/docs/assets/extra.css +27 -0
- strands_transformers-0.2.0/docs/assets/img/blue.png +0 -0
- strands_transformers-0.2.0/docs/assets/img/green.png +0 -0
- strands_transformers-0.2.0/docs/assets/logo.svg +22 -0
- strands_transformers-0.2.0/docs/guide/agent-brain.md +94 -0
- strands_transformers-0.2.0/docs/guide/agentic-robot.md +79 -0
- strands_transformers-0.2.0/docs/guide/audio.md +103 -0
- strands_transformers-0.2.0/docs/guide/compat.md +23 -0
- strands_transformers-0.2.0/docs/guide/content-blocks.md +91 -0
- strands_transformers-0.2.0/docs/guide/contributing.md +34 -0
- strands_transformers-0.2.0/docs/guide/installation.md +70 -0
- strands_transformers-0.2.0/docs/guide/quickstart.md +50 -0
- strands_transformers-0.2.0/docs/guide/robotics.md +106 -0
- strands_transformers-0.2.0/docs/guide/the-tool.md +94 -0
- strands_transformers-0.2.0/docs/index.md +94 -0
- strands_transformers-0.2.0/docs/reference/architecture.md +51 -0
- strands_transformers-0.2.0/docs/reference/examples.md +56 -0
- strands_transformers-0.2.0/docs/reference/transformer-model.md +20 -0
- strands_transformers-0.2.0/docs/reference/use-transformers.md +8 -0
- strands_transformers-0.2.0/examples/README.md +137 -0
- strands_transformers-0.2.0/examples/audio_content_block.py +78 -0
- strands_transformers-0.2.0/examples/cosmos_reason_embodied.py +77 -0
- strands_transformers-0.2.0/examples/document_and_audio.py +90 -0
- strands_transformers-0.2.0/examples/local_model_agent.py +41 -0
- strands_transformers-0.2.0/examples/molmoact_vla.py +83 -0
- strands_transformers-0.2.0/examples/multimodal_advanced.py +102 -0
- strands_transformers-0.2.0/examples/multimodal_agent.py +64 -0
- strands_transformers-0.2.0/examples/multimodal_pipelines.py +153 -0
- strands_transformers-0.2.0/examples/omni_audio.py +104 -0
- strands_transformers-0.2.0/examples/openvla_vla.py +98 -0
- strands_transformers-0.2.0/examples/robot_reason_act_agent.py +169 -0
- strands_transformers-0.2.0/examples/smoke.py +122 -0
- strands_transformers-0.2.0/examples/smolvlm_image_text.py +48 -0
- strands_transformers-0.2.0/examples/vision_tasks.py +85 -0
- strands_transformers-0.2.0/mkdocs.yml +96 -0
- strands_transformers-0.2.0/pyproject.toml +68 -0
- strands_transformers-0.2.0/requirements.txt +5 -0
- strands_transformers-0.2.0/setup.cfg +4 -0
- strands_transformers-0.2.0/setup.py +5 -0
- strands_transformers-0.2.0/strands_transformers/__init__.py +52 -0
- strands_transformers-0.2.0/strands_transformers/_version.py +24 -0
- strands_transformers-0.2.0/strands_transformers/core/__init__.py +5 -0
- strands_transformers-0.2.0/strands_transformers/core/compat.py +251 -0
- strands_transformers-0.2.0/strands_transformers/core/engine.py +160 -0
- strands_transformers-0.2.0/strands_transformers/core/io.py +273 -0
- strands_transformers-0.2.0/strands_transformers/core/registry.py +195 -0
- strands_transformers-0.2.0/strands_transformers/models/__init__.py +5 -0
- strands_transformers-0.2.0/strands_transformers/models/transformers.py +1421 -0
- strands_transformers-0.2.0/strands_transformers/tools/__init__.py +5 -0
- strands_transformers-0.2.0/strands_transformers/tools/use_transformers.py +409 -0
- strands_transformers-0.2.0/strands_transformers/types/__init__.py +24 -0
- strands_transformers-0.2.0/strands_transformers/types/audio.py +91 -0
- strands_transformers-0.2.0/strands_transformers.egg-info/PKG-INFO +252 -0
- strands_transformers-0.2.0/strands_transformers.egg-info/SOURCES.txt +64 -0
- strands_transformers-0.2.0/strands_transformers.egg-info/dependency_links.txt +1 -0
- strands_transformers-0.2.0/strands_transformers.egg-info/requires.txt +33 -0
- strands_transformers-0.2.0/strands_transformers.egg-info/top_level.txt +1 -0
|
@@ -0,0 +1,59 @@
|
|
|
1
|
+
name: Docs
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
branches: [main]
|
|
6
|
+
paths:
|
|
7
|
+
- "docs/**"
|
|
8
|
+
- "mkdocs.yml"
|
|
9
|
+
- "pyproject.toml"
|
|
10
|
+
- "strands_transformers/**"
|
|
11
|
+
- ".github/workflows/docs.yml"
|
|
12
|
+
workflow_dispatch:
|
|
13
|
+
|
|
14
|
+
# Allow one concurrent deployment; let in-progress runs finish.
|
|
15
|
+
concurrency:
|
|
16
|
+
group: pages
|
|
17
|
+
cancel-in-progress: false
|
|
18
|
+
|
|
19
|
+
permissions:
|
|
20
|
+
contents: read
|
|
21
|
+
pages: write
|
|
22
|
+
id-token: write
|
|
23
|
+
|
|
24
|
+
jobs:
|
|
25
|
+
build:
|
|
26
|
+
runs-on: ubuntu-latest
|
|
27
|
+
steps:
|
|
28
|
+
- uses: actions/checkout@v4
|
|
29
|
+
|
|
30
|
+
- name: Install uv
|
|
31
|
+
uses: astral-sh/setup-uv@v5
|
|
32
|
+
|
|
33
|
+
- name: Build docs
|
|
34
|
+
run: |
|
|
35
|
+
uv venv --python 3.12
|
|
36
|
+
# Docs build uses griffe's STATIC analysis for the API reference, so
|
|
37
|
+
# we only need the doc toolchain + the package importable as a path —
|
|
38
|
+
# NOT torch / transformers / strands (kept out via --no-deps).
|
|
39
|
+
uv pip install mkdocs-material "mkdocstrings[python]" pymdown-extensions
|
|
40
|
+
uv pip install --no-deps -e .
|
|
41
|
+
# Call the venv binary directly (avoid `uv run`, which would re-sync
|
|
42
|
+
# the full project and pull the heavy ML deps).
|
|
43
|
+
.venv/bin/mkdocs build --strict
|
|
44
|
+
|
|
45
|
+
- name: Upload Pages artifact
|
|
46
|
+
uses: actions/upload-pages-artifact@v3
|
|
47
|
+
with:
|
|
48
|
+
path: site
|
|
49
|
+
|
|
50
|
+
deploy:
|
|
51
|
+
needs: build
|
|
52
|
+
runs-on: ubuntu-latest
|
|
53
|
+
environment:
|
|
54
|
+
name: github-pages
|
|
55
|
+
url: ${{ steps.deployment.outputs.page_url }}
|
|
56
|
+
steps:
|
|
57
|
+
- name: Deploy to GitHub Pages
|
|
58
|
+
id: deployment
|
|
59
|
+
uses: actions/deploy-pages@v4
|
|
@@ -0,0 +1,80 @@
|
|
|
1
|
+
name: Release
|
|
2
|
+
|
|
3
|
+
on:
|
|
4
|
+
push:
|
|
5
|
+
tags:
|
|
6
|
+
- "v*.*.*"
|
|
7
|
+
|
|
8
|
+
permissions:
|
|
9
|
+
contents: write # create GitHub Release
|
|
10
|
+
|
|
11
|
+
jobs:
|
|
12
|
+
build:
|
|
13
|
+
runs-on: ubuntu-latest
|
|
14
|
+
steps:
|
|
15
|
+
- uses: actions/checkout@v4
|
|
16
|
+
with:
|
|
17
|
+
fetch-depth: 0 # full history so setuptools-scm sees the tag
|
|
18
|
+
|
|
19
|
+
- name: Install uv
|
|
20
|
+
uses: astral-sh/setup-uv@v5
|
|
21
|
+
|
|
22
|
+
- name: Build sdist + wheel
|
|
23
|
+
run: |
|
|
24
|
+
uv venv --python 3.12
|
|
25
|
+
uv pip install build
|
|
26
|
+
# setuptools-scm derives the version from the git tag (vX.Y.Z → X.Y.Z).
|
|
27
|
+
# `python -m build` is PEP 517-isolated (only needs setuptools-scm to
|
|
28
|
+
# build), so call the venv binary directly — no heavy ML deps pulled.
|
|
29
|
+
.venv/bin/python -m build
|
|
30
|
+
|
|
31
|
+
- name: Show built artifacts
|
|
32
|
+
run: ls -l dist/
|
|
33
|
+
|
|
34
|
+
- name: Upload artifacts
|
|
35
|
+
uses: actions/upload-artifact@v4
|
|
36
|
+
with:
|
|
37
|
+
name: dist
|
|
38
|
+
path: dist/
|
|
39
|
+
|
|
40
|
+
pypi:
|
|
41
|
+
needs: build
|
|
42
|
+
runs-on: ubuntu-latest
|
|
43
|
+
steps:
|
|
44
|
+
- uses: actions/download-artifact@v4
|
|
45
|
+
with:
|
|
46
|
+
name: dist
|
|
47
|
+
path: dist/
|
|
48
|
+
# Publish with the PYPI_API_TOKEN repo secret.
|
|
49
|
+
- name: Publish to PyPI
|
|
50
|
+
uses: pypa/gh-action-pypi-publish@release/v1
|
|
51
|
+
with:
|
|
52
|
+
password: ${{ secrets.PYPI_API_TOKEN }}
|
|
53
|
+
|
|
54
|
+
github-release:
|
|
55
|
+
needs: pypi
|
|
56
|
+
runs-on: ubuntu-latest
|
|
57
|
+
steps:
|
|
58
|
+
- uses: actions/download-artifact@v4
|
|
59
|
+
with:
|
|
60
|
+
name: dist
|
|
61
|
+
path: dist/
|
|
62
|
+
- name: Extract version
|
|
63
|
+
id: v
|
|
64
|
+
run: echo "version=${GITHUB_REF#refs/tags/v}" >> "$GITHUB_OUTPUT"
|
|
65
|
+
- name: Create GitHub Release
|
|
66
|
+
uses: softprops/action-gh-release@v2
|
|
67
|
+
with:
|
|
68
|
+
name: strands-transformers v${{ steps.v.outputs.version }}
|
|
69
|
+
generate_release_notes: true
|
|
70
|
+
files: dist/*
|
|
71
|
+
body: |
|
|
72
|
+
## 🤗 strands-transformers v${{ steps.v.outputs.version }}
|
|
73
|
+
|
|
74
|
+
```bash
|
|
75
|
+
uv pip install strands-transformers==${{ steps.v.outputs.version }}
|
|
76
|
+
# or
|
|
77
|
+
pip install strands-transformers==${{ steps.v.outputs.version }}
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
📖 Docs: https://cagataycali.github.io/strands-transformers/
|
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
# Architecture
|
|
2
|
+
|
|
3
|
+
`strands-transformers` is the universal entrypoint to HuggingFace transformers for
|
|
4
|
+
Strands agents — 100% task & modality coverage with zero hardcoding. It reads
|
|
5
|
+
transformers' own task taxonomy at runtime, so new tasks/models work without code
|
|
6
|
+
changes (the same philosophy as `use_aws` wrapping boto3 and `use_lerobot`
|
|
7
|
+
wrapping lerobot).
|
|
8
|
+
|
|
9
|
+
## Layout
|
|
10
|
+
|
|
11
|
+
```
|
|
12
|
+
strands_transformers/
|
|
13
|
+
├── core/
|
|
14
|
+
│ ├── registry.py # task taxonomy + dynamic class/attr resolution
|
|
15
|
+
│ ├── engine.py # load/cache pipelines & models, device/dtype selection
|
|
16
|
+
│ ├── io.py # multimodal input coercion + JSON-safe output serialization
|
|
17
|
+
│ └── compat.py # backward-compat shims for legacy trust_remote_code models
|
|
18
|
+
├── models/
|
|
19
|
+
│ └── transformers.py # TransformerModel — a Strands model provider (local brain)
|
|
20
|
+
└── tools/
|
|
21
|
+
└── use_transformers.py # the single @tool agents call
|
|
22
|
+
examples/ # runnable, GPU-verified examples (see examples/README.md)
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
## Data flow
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
agent → use_transformers(action=...) ─┬─ discovery → registry
|
|
29
|
+
├─ run(task) → engine.get_pipeline → pipeline(inputs) → io.serialize_output
|
|
30
|
+
└─ call(target)→ registry.resolve_attr / cached: → obj(**params) → io.serialize_output
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
### `core/registry.py` — the source of truth
|
|
34
|
+
- `supported_tasks()` reads transformers' `SUPPORTED_TASKS` → `{task: {type, auto_models, default_model, pipeline_class}}`.
|
|
35
|
+
- `tasks_by_modality()`, `task_info()`, `resolve_task()` (tolerant of underscores/hyphens).
|
|
36
|
+
- `auto_model_classes()` lists every `Auto*` entrypoint.
|
|
37
|
+
- `resolve_attr(dotted)` resolves any dotted path into transformers (class, fn,
|
|
38
|
+
method), with a root-getattr fast path so transformers' lazy `__getattr__`
|
|
39
|
+
(which raises `AttributeError` on submodule-import attempts) never aborts
|
|
40
|
+
resolution.
|
|
41
|
+
- `describe(obj)` introspects signatures/docstrings for the `inspect` action.
|
|
42
|
+
|
|
43
|
+
### `core/engine.py` — load, cache, run
|
|
44
|
+
- `select_device()` / `select_dtype()` auto-pick cuda/mps/cpu and bf16/fp16.
|
|
45
|
+
- `get_pipeline(task, model, ...)` builds & caches a `transformers.pipeline`.
|
|
46
|
+
Image-output tasks (depth-estimation, segmentation, image-to-image,
|
|
47
|
+
mask-generation) are kept in **float32** — half precision breaks PIL/numpy
|
|
48
|
+
post-processing.
|
|
49
|
+
- `load_object(auto_class, model_path, ...)` loads any `AutoModel*` / `AutoProcessor`
|
|
50
|
+
/ `AutoTokenizer` via `from_pretrained` for the low-level `call` layer.
|
|
51
|
+
- `_CACHE` holds pipelines/models/processors keyed by `cache_key` for the session.
|
|
52
|
+
|
|
53
|
+
### `core/io.py` — multimodal I/O
|
|
54
|
+
- **In:** `coerce_input` decodes base64 data-URIs to PIL/bytes; paths/URLs/arrays
|
|
55
|
+
pass through natively. `decode_wav` / `maybe_decode_audio_path` pre-decode WAV
|
|
56
|
+
files for audio tasks with the stdlib `wave` module (no ffmpeg needed).
|
|
57
|
+
- **Out:** `serialize_output` converts any result to JSON-safe form — audio dicts
|
|
58
|
+
→ `.wav` artifacts, PIL images → `.png` artifacts, torch/numpy tensors → lists
|
|
59
|
+
(bf16/fp16 upcast to float32 first), with `_ensure_json_safe` as a final guard.
|
|
60
|
+
|
|
61
|
+
### `core/compat.py` — legacy model support
|
|
62
|
+
Patches transformers 4.x→5.x gaps so old `trust_remote_code` models (e.g. OpenVLA)
|
|
63
|
+
run unchanged. Idempotent + re-entrant (`force=True`) because remote code can
|
|
64
|
+
re-import transformers mid-load:
|
|
65
|
+
- moved tokenizer symbols (`PaddingStrategy`, …) re-exposed on a real
|
|
66
|
+
file-backed `tokenization_utils` module;
|
|
67
|
+
- `AutoModelForVision2Seq` recreated as an `AutoModelForImageTextToText` alias,
|
|
68
|
+
asserted everywhere `auto_map` dispatch and `register_for_auto_class()` look;
|
|
69
|
+
- `tie_weights()` signature drift made kwarg-tolerant via an `init_weights` wrap;
|
|
70
|
+
- broken-torchcodec detection disabled so audio pipelines take the array path;
|
|
71
|
+
- `spoof_timm_version()` for models with hard timm pins.
|
|
72
|
+
|
|
73
|
+
### `tools/use_transformers.py` — the one tool
|
|
74
|
+
Two layers + discovery:
|
|
75
|
+
- **run** — high-level pipelines (native multimodal). Folds separate images into
|
|
76
|
+
chat content for `image-text-to-text`; pre-decodes WAV for audio tasks.
|
|
77
|
+
- **call** — dynamic dispatch to any class/fn/method. `cached:key[.attr]` refs
|
|
78
|
+
resolve to live cached objects (including inside `parameters`); a `"**"` param
|
|
79
|
+
key unpacks a cached mapping into kwargs (e.g. `model.predict_action(**batch)`).
|
|
80
|
+
- **discovery** — `tasks`, `modalities`, `task_info`, `classes`, `inspect`,
|
|
81
|
+
`cache`, `clear_cache`, `compat`.
|
|
82
|
+
|
|
83
|
+
### `models/transformers.py` — local brain
|
|
84
|
+
`TransformerModel` is a Strands model provider running any local HF causal-LM as
|
|
85
|
+
the agent's reasoning engine (streaming, chat templates, Qwen3 `<think>`, XML
|
|
86
|
+
tool-calling). Pair it with `use_transformers` for a fully local multimodal agent.
|
|
87
|
+
|
|
88
|
+
## Testing philosophy
|
|
89
|
+
|
|
90
|
+
Every change is verified **end-to-end against the real implementation** — actual
|
|
91
|
+
model inference / pipelines, not mocks. `examples/smoke.py` is a fast (no large
|
|
92
|
+
downloads) 12-check gate across discovery + text/image/audio that exits non-zero
|
|
93
|
+
on failure.
|
|
@@ -0,0 +1,252 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: strands-transformers
|
|
3
|
+
Version: 0.2.0
|
|
4
|
+
Summary: The universal entrypoint to HuggingFace transformers for Strands agents — 100% task & modality coverage, zero hardcoding.
|
|
5
|
+
Author-email: Cagatay Cali <cagataycali@icloud.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/cagataycali/strands-transformers
|
|
8
|
+
Project-URL: Repository, https://github.com/cagataycali/strands-transformers
|
|
9
|
+
Project-URL: Issues, https://github.com/cagataycali/strands-transformers/issues
|
|
10
|
+
Keywords: strands,transformers,huggingface,ai,agents,multimodal,vision,audio,video,vla,robotics,llm
|
|
11
|
+
Classifier: Development Status :: 4 - Beta
|
|
12
|
+
Classifier: Intended Audience :: Developers
|
|
13
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
14
|
+
Classifier: Programming Language :: Python :: 3
|
|
15
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
16
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.13
|
|
19
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
20
|
+
Requires-Python: >=3.10
|
|
21
|
+
Description-Content-Type: text/markdown
|
|
22
|
+
Requires-Dist: strands-agents
|
|
23
|
+
Requires-Dist: transformers>=4.40
|
|
24
|
+
Requires-Dist: torch
|
|
25
|
+
Requires-Dist: pillow
|
|
26
|
+
Requires-Dist: numpy
|
|
27
|
+
Provides-Extra: audio
|
|
28
|
+
Requires-Dist: soundfile; extra == "audio"
|
|
29
|
+
Requires-Dist: librosa; extra == "audio"
|
|
30
|
+
Provides-Extra: vision
|
|
31
|
+
Requires-Dist: pillow; extra == "vision"
|
|
32
|
+
Requires-Dist: opencv-python; extra == "vision"
|
|
33
|
+
Requires-Dist: av; extra == "vision"
|
|
34
|
+
Provides-Extra: training
|
|
35
|
+
Requires-Dist: trl; extra == "training"
|
|
36
|
+
Requires-Dist: peft; extra == "training"
|
|
37
|
+
Requires-Dist: accelerate; extra == "training"
|
|
38
|
+
Requires-Dist: datasets; extra == "training"
|
|
39
|
+
Provides-Extra: dev
|
|
40
|
+
Requires-Dist: pytest>=7.0; extra == "dev"
|
|
41
|
+
Requires-Dist: black; extra == "dev"
|
|
42
|
+
Requires-Dist: ruff; extra == "dev"
|
|
43
|
+
Provides-Extra: docs
|
|
44
|
+
Requires-Dist: mkdocs-material; extra == "docs"
|
|
45
|
+
Requires-Dist: mkdocstrings[python]; extra == "docs"
|
|
46
|
+
Requires-Dist: pymdown-extensions; extra == "docs"
|
|
47
|
+
Provides-Extra: all
|
|
48
|
+
Requires-Dist: strands-transformers[audio,dev,docs,training,vision]; extra == "all"
|
|
49
|
+
|
|
50
|
+
<div align="center">
|
|
51
|
+
<h1>🤗 Strands Transformers</h1>
|
|
52
|
+
<h3>One tool wraps <i>all</i> of HuggingFace transformers. One provider makes any local model a multimodal agent brain.</h3>
|
|
53
|
+
<p><b>Agents that see, hear, and speak — 100% task coverage, zero hardcoding, fully local.</b></p>
|
|
54
|
+
|
|
55
|
+
<div>
|
|
56
|
+
<a href="https://pypi.org/project/strands-transformers/"><img alt="pypi" src="https://img.shields.io/pypi/v/strands-transformers"/></a>
|
|
57
|
+
<a href="https://github.com/cagataycali/strands-transformers/actions/workflows/docs.yml"><img alt="docs" src="https://github.com/cagataycali/strands-transformers/actions/workflows/docs.yml/badge.svg"/></a>
|
|
58
|
+
<a href="https://github.com/cagataycali/strands-transformers/issues"><img alt="issues" src="https://img.shields.io/github/issues/cagataycali/strands-transformers"/></a>
|
|
59
|
+
<img alt="python" src="https://img.shields.io/badge/python-3.10+-blue"/>
|
|
60
|
+
<img alt="transformers" src="https://img.shields.io/badge/🤗_transformers-24_tasks-yellow"/>
|
|
61
|
+
<img alt="modalities" src="https://img.shields.io/badge/modalities-text·image·video·audio-orange"/>
|
|
62
|
+
<img alt="license" src="https://img.shields.io/badge/license-MIT-green"/>
|
|
63
|
+
</div>
|
|
64
|
+
</div>
|
|
65
|
+
|
|
66
|
+
---
|
|
67
|
+
|
|
68
|
+
`use_aws` wraps all of boto3. `use_lerobot` wraps all of lerobot.
|
|
69
|
+
**`use_transformers` wraps all of HuggingFace transformers** — every task, every
|
|
70
|
+
modality, in one tool that reads transformers' own taxonomy at runtime (new task
|
|
71
|
+
upstream ⇒ supported here with **no code change**). And **`TransformerModel`** makes
|
|
72
|
+
any **local** HF model a drop-in Strands brain that speaks the full content-block
|
|
73
|
+
protocol — image, video, audio, document. With Qwen2.5-Omni it even **speaks back**.
|
|
74
|
+
|
|
75
|
+
```mermaid
|
|
76
|
+
flowchart LR
|
|
77
|
+
IN["📥 text · image · video<br/>audio · document · robot-state"]
|
|
78
|
+
TOOL["🛠️ use_transformers<br/><i>tool</i>"]
|
|
79
|
+
BRAIN["🧠 TransformerModel<br/><i>local agent brain</i>"]
|
|
80
|
+
OUT["📤 text · speech · image<br/>labels · actions"]
|
|
81
|
+
IN --> TOOL --> OUT
|
|
82
|
+
IN --> BRAIN --> OUT
|
|
83
|
+
classDef i fill:#7C4DFF,stroke:#5b34d6,color:#fff;
|
|
84
|
+
classDef c fill:#FFD21E,stroke:#E68A00,color:#3a2d00;
|
|
85
|
+
classDef o fill:#00E5FF,stroke:#00b3cc,color:#003844;
|
|
86
|
+
class IN i;
|
|
87
|
+
class TOOL,BRAIN c;
|
|
88
|
+
class OUT o;
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
📖 **[Full documentation →](https://cagataycali.github.io/strands-transformers/)** · built with MkDocs (`docs/`)
|
|
92
|
+
|
|
93
|
+
## Install
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
uv pip install strands-transformers # from PyPI
|
|
97
|
+
# or from source:
|
|
98
|
+
uv pip install -e . # or: pip install -e .
|
|
99
|
+
PYTHONPATH=. python examples/smoke.py # verify → "12/12 checks passed"
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
<details>
|
|
103
|
+
<summary>Optional extras (audio · vision · training · docs)</summary>
|
|
104
|
+
|
|
105
|
+
```bash
|
|
106
|
+
uv pip install -e ".[audio]" # soundfile, librosa (mp3/flac/ogg decode)
|
|
107
|
+
uv pip install -e ".[vision]" # opencv, av (video)
|
|
108
|
+
uv pip install -e ".[training]" # trl, peft, accelerate
|
|
109
|
+
uv pip install -e ".[docs]" # mkdocs-material, mkdocstrings
|
|
110
|
+
uv pip install -e ".[all]" # everything
|
|
111
|
+
```
|
|
112
|
+
WAV audio works without extras. `device="auto"` picks cuda → mps → cpu (bf16 on GPU).
|
|
113
|
+
</details>
|
|
114
|
+
|
|
115
|
+
## 60-second hello — a local vision agent
|
|
116
|
+
|
|
117
|
+
```python
|
|
118
|
+
import io
|
|
119
|
+
from PIL import Image
|
|
120
|
+
from strands import Agent
|
|
121
|
+
from strands_transformers import TransformerModel
|
|
122
|
+
|
|
123
|
+
buf = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(buf, "PNG") # green square
|
|
124
|
+
|
|
125
|
+
model = TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct")
|
|
126
|
+
agent = Agent(model=model, system_prompt="You are concise.")
|
|
127
|
+
|
|
128
|
+
print(agent([
|
|
129
|
+
{"image": {"format": "png", "source": {"bytes": buf.getvalue()}}},
|
|
130
|
+
{"text": "Color? One word."},
|
|
131
|
+
]))
|
|
132
|
+
# → Green.
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
A 256M-param model in the standard Strands loop, *seeing* pixels through a content
|
|
136
|
+
block — no API key, no server. Swap `model_path` for any HF VLM.
|
|
137
|
+
|
|
138
|
+
## See it work
|
|
139
|
+
|
|
140
|
+
Every output below is a **real** model result (CUDA · transformers 5.12 · torch 2.10):
|
|
141
|
+
|
|
142
|
+
| You give it | Script | It returns |
|
|
143
|
+
|-------------|--------|-----------|
|
|
144
|
+
| 🖼️ a green image + "Color?" | `examples/multimodal_agent.py` | `"Green."` |
|
|
145
|
+
| 🎬 brightening frames | `examples/multimodal_advanced.py` | `"BRIGHTER."` |
|
|
146
|
+
| 🧰 a tool screenshot (blue) | `examples/multimodal_advanced.py` | `"Blue."` |
|
|
147
|
+
| 📄 a text document | `examples/document_and_audio.py` | recovers `BANANA-42` |
|
|
148
|
+
| 🔊 a 440 Hz tone (Omni) | `examples/omni_audio.py` | `"It's a pure tone."` |
|
|
149
|
+
| 💬 "say: …can speak" (Omni) | `examples/omni_audio.py` | 🔊 real 24 kHz speech |
|
|
150
|
+
|
|
151
|
+
▶️ **[Hear Omni speak + see all diagrams in the docs →](https://cagataycali.github.io/strands-transformers/)**
|
|
152
|
+
|
|
153
|
+
## Two ways to use it
|
|
154
|
+
|
|
155
|
+
<details open>
|
|
156
|
+
<summary><b>As a tool</b> — <code>use_transformers</code> (discover · run · call)</summary>
|
|
157
|
+
|
|
158
|
+
```python
|
|
159
|
+
from strands import Agent
|
|
160
|
+
from strands_transformers import use_transformers
|
|
161
|
+
|
|
162
|
+
agent = Agent(tools=[use_transformers])
|
|
163
|
+
agent("Transcribe recording.wav") # automatic-speech-recognition
|
|
164
|
+
agent("What's in scene.jpg?") # image-text-to-text
|
|
165
|
+
agent("Say 'hello from strands' as audio") # text-to-audio
|
|
166
|
+
agent("Detect objects in https://.../street.jpg") # object-detection
|
|
167
|
+
```
|
|
168
|
+
|
|
169
|
+
Discover everything at runtime (`action="tasks" | "modalities" | "inspect" | …`),
|
|
170
|
+
run high-level pipelines, or `call` any class/fn/method for custom models.
|
|
171
|
+
→ **[The tool guide](https://cagataycali.github.io/strands-transformers/guide/the-tool/)**
|
|
172
|
+
</details>
|
|
173
|
+
|
|
174
|
+
<details>
|
|
175
|
+
<summary><b>As the agent's brain</b> — <code>TransformerModel</code> (multimodal content blocks)</summary>
|
|
176
|
+
|
|
177
|
+
Pass `image` / `video` / `audio` / `document` content blocks (and media inside a
|
|
178
|
+
`toolResult`) — the provider auto-detects the model's processor and routes them.
|
|
179
|
+
All outputs below are **real** results (CUDA, transformers 5.12 / torch 2.10):
|
|
180
|
+
|
|
181
|
+
| Content block | Example | Verified output |
|
|
182
|
+
|---|---|---|
|
|
183
|
+
| `image` | `multimodal_agent.py` | `"Green."` |
|
|
184
|
+
| `video` (with `fps`) | `multimodal_advanced.py` | `"BRIGHTER."` |
|
|
185
|
+
| `image` in `toolResult` | `multimodal_advanced.py` | `"Blue."` |
|
|
186
|
+
| `document` | `document_and_audio.py` | recovers `BANANA-42` |
|
|
187
|
+
| `audio` *(our schema extension)* | `audio_content_block.py` | audio → text |
|
|
188
|
+
| `audio` in **and** speech out | `omni_audio.py` | hears + **speaks** (Qwen2.5-Omni) |
|
|
189
|
+
|
|
190
|
+
→ **[Agent brain](https://cagataycali.github.io/strands-transformers/guide/agent-brain/)** ·
|
|
191
|
+
**[Content blocks](https://cagataycali.github.io/strands-transformers/guide/content-blocks/)** ·
|
|
192
|
+
**[Audio](https://cagataycali.github.io/strands-transformers/guide/audio/)**
|
|
193
|
+
</details>
|
|
194
|
+
|
|
195
|
+
<details>
|
|
196
|
+
<summary><b>Robotics / VLA</b> — camera + instruction → robot actions</summary>
|
|
197
|
+
|
|
198
|
+
Two layers, both transformers-native and GPU-verified:
|
|
199
|
+
- 🧠 **reason** — [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B)
|
|
200
|
+
(a physical-AI VLM) plans over a scene via the `run` path: *"the red cube is in
|
|
201
|
+
the bottom left corner, so the arm should move there first."*
|
|
202
|
+
- ⚙️ **act** — VLA models expose `predict_action` via the `call` path:
|
|
203
|
+
[MolmoAct2](https://huggingface.co/allenai/MolmoAct2-SO100_101) → `[1,30,6]`;
|
|
204
|
+
[OpenVLA-7b](https://huggingface.co/openvla/openvla-7b) → 7-DoF (auto 4.x→5.x shims).
|
|
205
|
+
|
|
206
|
+
🔗 **Full agentic loop** ([`examples/robot_reason_act_agent.py`](examples/robot_reason_act_agent.py)):
|
|
207
|
+
Cosmos-Reason *plans* over real RealSense frames → MolmoAct *acts* (`[1,30,6]`) —
|
|
208
|
+
perception→plan→action through one tool.
|
|
209
|
+
|
|
210
|
+
Lerobot-ecosystem policies (SmolVLA, π0, ACT, GR00T) use their own runtimes —
|
|
211
|
+
pair with `use_lerobot`.
|
|
212
|
+
→ **[Robotics guide](https://cagataycali.github.io/strands-transformers/guide/robotics/)**
|
|
213
|
+
</details>
|
|
214
|
+
|
|
215
|
+
## How it works
|
|
216
|
+
|
|
217
|
+
Nothing is hardcoded per task — `core/registry.py` reads transformers' own
|
|
218
|
+
`SUPPORTED_TASKS` at runtime, so coverage tracks upstream automatically.
|
|
219
|
+
|
|
220
|
+
<details>
|
|
221
|
+
<summary>Project layout</summary>
|
|
222
|
+
|
|
223
|
+
```
|
|
224
|
+
strands_transformers/
|
|
225
|
+
├── tools/use_transformers.py # the one @tool: discover · run · call
|
|
226
|
+
├── models/transformers.py # TransformerModel — local multimodal agent brain
|
|
227
|
+
├── types/audio.py # audio content-block extension
|
|
228
|
+
└── core/{registry,engine,io,compat}.py # taxonomy · load/cache · I/O · legacy shims
|
|
229
|
+
```
|
|
230
|
+
→ **[Architecture](https://cagataycali.github.io/strands-transformers/reference/architecture/)** ·
|
|
231
|
+
**[API reference](https://cagataycali.github.io/strands-transformers/reference/transformer-model/)**
|
|
232
|
+
</details>
|
|
233
|
+
|
|
234
|
+
## Examples
|
|
235
|
+
|
|
236
|
+
12 runnable, GPU-verified examples in [`examples/`](examples/) — image, video,
|
|
237
|
+
audio, document, Omni speech, VLA, and pipelines. Run any:
|
|
238
|
+
|
|
239
|
+
```bash
|
|
240
|
+
PYTHONPATH=. python examples/<name>.py
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
→ **[Examples & FAQ](https://cagataycali.github.io/strands-transformers/reference/examples/)**
|
|
244
|
+
|
|
245
|
+
## License
|
|
246
|
+
|
|
247
|
+
MIT — built with [Strands Agents SDK](https://github.com/strands-agents/sdk-python)
|
|
248
|
+
and [HuggingFace Transformers](https://github.com/huggingface/transformers).
|
|
249
|
+
|
|
250
|
+
<div align="center">
|
|
251
|
+
<sub>If this saved you a pile of per-model glue code, consider giving it a ⭐</sub>
|
|
252
|
+
</div>
|