PyPI - trinity-rft - Versions diffs - 0.2.0__tar.gz → 0.2.1.dev0__tar.gz - Mend

trinity-rft 0.2.0tar.gz → 0.2.1.dev0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (163) hide show

{trinity_rft-0.2.0 → trinity_rft-0.2.1.dev0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: trinity-rft
-Version: 0.2.0
+Version: 0.2.1.dev0
 Summary: Trinity-RFT: A Framework for Training Large Language Models with Reinforcement Fine-Tuning
 Author-email: Trinity-RFT Team <trinity-rft@outlook.com>
 Project-URL: Homepage, https://github.com/modelscope/Trinity-RFT
@@ -15,9 +15,9 @@ Classifier: Programming Language :: Python :: 3.12
 Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: verl==0.4.0
+Requires-Dist: verl==0.4.1
 Requires-Dist: ray[default]>=2.45.0
-Requires-Dist: vllm==0.9.1
+Requires-Dist: vllm<=0.10.0,>=0.9.1
 Requires-Dist: tensordict==0.6.2
 Requires-Dist: wandb
 Requires-Dist: omegaconf
@@ -35,6 +35,8 @@ Requires-Dist: tensorboard
 Requires-Dist: openai
 Requires-Dist: jsonlines
 Requires-Dist: sortedcontainers
+Requires-Dist: word2number
+Requires-Dist: transformers<4.54.0
 Provides-Extra: data
 Requires-Dist: py-data-juicer; extra == "data"
 Provides-Extra: agent
@@ -57,10 +59,13 @@ Requires-Dist: sphinx; extra == "doc"
 Requires-Dist: sphinx-autobuild; extra == "doc"
 Requires-Dist: sphinx_rtd_theme; extra == "doc"
 Requires-Dist: myst-parser; extra == "doc"
+Requires-Dist: sphinxcontrib-apidoc; extra == "doc"
+Requires-Dist: sphinx-multiversion; extra == "doc"
 Provides-Extra: flash-attn
 Requires-Dist: flash-attn==2.8.0.post2; extra == "flash-attn"
 Dynamic: license-file
+[**中文主页**](https://github.com/modelscope/Trinity-RFT/blob/main/README_zh.md) | [**Tutorial**](https://modelscope.github.io/Trinity-RFT/) | [**FAQ**](./docs/sphinx_doc/source/tutorial/faq.md)
 <div align="center">
   <img src="https://img.alicdn.com/imgextra/i1/O1CN01lvLpfw25Pl4ohGZnU_!!6000000007519-2-tps-1628-490.png" alt="Trinity-RFT" style="height: 120px;">
@@ -84,6 +89,9 @@ Dynamic: license-file
 ## 🚀 News
+* [2025-08] ✨ Trinity-RFT v0.2.1 is released with enhanced features for Agentic RL and Async RL.
+* [2025-08] 🎵 We introduce [CHORD](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord), a dynamic integration of SFT and RL for enhanced LLM fine-tuning ([paper](https://arxiv.org/pdf/2508.11408)).
+* [2025-08] We now support training on general multi-step workflows! Please check out examples for [ALFWorld](./docs/sphinx_doc/source/tutorial/example_step_wise.md) and [ReAct](./docs/sphinx_doc/source/tutorial/example_react.md).
 * [2025-07] Trinity-RFT v0.2.0 is released.
 * [2025-07] We update the [technical report](https://arxiv.org/abs/2505.17826) (arXiv v2) with new features, examples, and experiments.
 * [2025-06] Trinity-RFT v0.1.1 is released.
@@ -143,6 +151,15 @@ It is designed to support diverse application scenarios and serve as a unified p
   <img src="https://img.alicdn.com/imgextra/i3/O1CN01E7NskS1FFoTI9jlaQ_!!6000000000458-2-tps-1458-682.png" alt="Trinity-RFT-modes">
 </p>
+</details>
+<details>
+<summary>Figure: Concatenated and general multi-step workflows</summary>
+<p align="center">
+  <img src="https://img.alicdn.com/imgextra/i1/O1CN01z1i7kk1jlMEVa8ZHV_!!6000000004588-2-tps-1262-695.png" alt="Trinity-RFT-multi-step">
+</p>
 </details>
@@ -214,6 +231,12 @@ It is designed to support diverse application scenarios and serve as a unified p
 ### Step 1: installation
+Requirements:
+- Python version >= 3.10, <= 3.12
+- CUDA version >= 12.4, <= 12.8
+- At least 2 GPUs
 Installation from source **(recommended)**:
 ```shell
@@ -243,13 +266,15 @@ pip install -e .[flash_attn]
 # for zsh
 pip install -e .\[flash_attn\]
 # Try the following command if you encounter errors during flash-attn installation
-# pip install flash-attn -v --no-build-isolation
+# pip install flash-attn==2.8.0.post2 -v --no-build-isolation
 ```
 Installation using pip:
 ```shell
 pip install trinity-rft==0.2.0
+# install flash-attn separately
+pip install flash-attn==2.8.0.post2
 ```
 Installation from docker:
@@ -268,13 +293,6 @@ docker build -f scripts/docker/Dockerfile -t trinity-rft:latest .
 docker run -it --gpus all --shm-size="64g" --rm -v $PWD:/workspace -v <root_path_of_data_and_checkpoints>:/data trinity-rft:latest
 ```
-**Requirements:**
-Python version >= 3.10,
-CUDA version >= 12.4,
-and at least 2 GPUs.
 ### Step 2: prepare dataset and model
@@ -291,7 +309,7 @@ huggingface-cli download {model_name} --local-dir $MODEL_PATH/{model_name}
 modelscope download {model_name} --local_dir $MODEL_PATH/{model_name}
 ```
-For more details about model downloading, see [Huggingface](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) or  [ModelScope](https://modelscope.cn/docs/models/download).
+For more details about model downloading, see [Huggingface](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) or [ModelScope](https://modelscope.cn/docs/models/download).
@@ -318,13 +336,13 @@ Trinity-RFT provides a web interface for configuring your RFT process.
 > This is an experimental feature, and we will continue to improve it.
-To enable minimal features (mainly for trainer), you can run
+To launch the web interface for minimal configurations, you can run
 ```bash
 trinity studio --port 8080
 ```
-Then you can configure your RFT process in the web page and generate a config file. You can save the config for later use or run it directly as described in the following section.
+Then you can configure your RFT process in the web page and generate a config file. You can save the config file for later use or run it directly as described in the following section.
 Advanced users can also edit the config file directly.
 We provide example config files in [`examples`](examples/).
@@ -392,7 +410,12 @@ Tutorials for running different RFT modes:
 Tutorials for adapting Trinity-RFT to a new multi-turn agentic scenario:
-+ [Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md)
++ [Concatenated Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md)
+Tutorials for adapting Trinity-RFT to a general multi-step agentic scenario:
++ [General Multi-Step tasks](./docs/sphinx_doc/source/tutorial/example_step_wise.md)
++ [ReAct agent tasks](./docs/sphinx_doc/source/tutorial/example_react.md)
 Tutorials for data-related functionalities:
@@ -416,10 +439,6 @@ Guidelines for developers and researchers:
-For some frequently asked questions, see [FAQ](./docs/sphinx_doc/source/tutorial/faq.md).
 ## Upcoming features

{trinity_rft-0.2.0 → trinity_rft-0.2.1.dev0}/README.md RENAMED Viewed

@@ -1,3 +1,4 @@
+[**中文主页**](https://github.com/modelscope/Trinity-RFT/blob/main/README_zh.md) | [**Tutorial**](https://modelscope.github.io/Trinity-RFT/) | [**FAQ**](./docs/sphinx_doc/source/tutorial/faq.md)
 <div align="center">
   <img src="https://img.alicdn.com/imgextra/i1/O1CN01lvLpfw25Pl4ohGZnU_!!6000000007519-2-tps-1628-490.png" alt="Trinity-RFT" style="height: 120px;">
@@ -21,6 +22,9 @@
 ## 🚀 News
+* [2025-08] ✨ Trinity-RFT v0.2.1 is released with enhanced features for Agentic RL and Async RL.
+* [2025-08] 🎵 We introduce [CHORD](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_chord), a dynamic integration of SFT and RL for enhanced LLM fine-tuning ([paper](https://arxiv.org/pdf/2508.11408)).
+* [2025-08] We now support training on general multi-step workflows! Please check out examples for [ALFWorld](./docs/sphinx_doc/source/tutorial/example_step_wise.md) and [ReAct](./docs/sphinx_doc/source/tutorial/example_react.md).
 * [2025-07] Trinity-RFT v0.2.0 is released.
 * [2025-07] We update the [technical report](https://arxiv.org/abs/2505.17826) (arXiv v2) with new features, examples, and experiments.
 * [2025-06] Trinity-RFT v0.1.1 is released.
@@ -80,6 +84,15 @@ It is designed to support diverse application scenarios and serve as a unified p
   <img src="https://img.alicdn.com/imgextra/i3/O1CN01E7NskS1FFoTI9jlaQ_!!6000000000458-2-tps-1458-682.png" alt="Trinity-RFT-modes">
 </p>
+</details>
+<details>
+<summary>Figure: Concatenated and general multi-step workflows</summary>
+<p align="center">
+  <img src="https://img.alicdn.com/imgextra/i1/O1CN01z1i7kk1jlMEVa8ZHV_!!6000000004588-2-tps-1262-695.png" alt="Trinity-RFT-multi-step">
+</p>
 </details>
@@ -151,6 +164,12 @@ It is designed to support diverse application scenarios and serve as a unified p
 ### Step 1: installation
+Requirements:
+- Python version >= 3.10, <= 3.12
+- CUDA version >= 12.4, <= 12.8
+- At least 2 GPUs
 Installation from source **(recommended)**:
 ```shell
@@ -180,13 +199,15 @@ pip install -e .[flash_attn]
 # for zsh
 pip install -e .\[flash_attn\]
 # Try the following command if you encounter errors during flash-attn installation
-# pip install flash-attn -v --no-build-isolation
+# pip install flash-attn==2.8.0.post2 -v --no-build-isolation
 ```
 Installation using pip:
 ```shell
 pip install trinity-rft==0.2.0
+# install flash-attn separately
+pip install flash-attn==2.8.0.post2
 ```
 Installation from docker:
@@ -205,13 +226,6 @@ docker build -f scripts/docker/Dockerfile -t trinity-rft:latest .
 docker run -it --gpus all --shm-size="64g" --rm -v $PWD:/workspace -v <root_path_of_data_and_checkpoints>:/data trinity-rft:latest
 ```
-**Requirements:**
-Python version >= 3.10,
-CUDA version >= 12.4,
-and at least 2 GPUs.
 ### Step 2: prepare dataset and model
@@ -228,7 +242,7 @@ huggingface-cli download {model_name} --local-dir $MODEL_PATH/{model_name}
 modelscope download {model_name} --local_dir $MODEL_PATH/{model_name}
 ```
-For more details about model downloading, see [Huggingface](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) or  [ModelScope](https://modelscope.cn/docs/models/download).
+For more details about model downloading, see [Huggingface](https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) or [ModelScope](https://modelscope.cn/docs/models/download).
@@ -255,13 +269,13 @@ Trinity-RFT provides a web interface for configuring your RFT process.
 > This is an experimental feature, and we will continue to improve it.
-To enable minimal features (mainly for trainer), you can run
+To launch the web interface for minimal configurations, you can run
 ```bash
 trinity studio --port 8080
 ```
-Then you can configure your RFT process in the web page and generate a config file. You can save the config for later use or run it directly as described in the following section.
+Then you can configure your RFT process in the web page and generate a config file. You can save the config file for later use or run it directly as described in the following section.
 Advanced users can also edit the config file directly.
 We provide example config files in [`examples`](examples/).
@@ -329,7 +343,12 @@ Tutorials for running different RFT modes:
 Tutorials for adapting Trinity-RFT to a new multi-turn agentic scenario:
-+ [Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md)
++ [Concatenated Multi-turn tasks](./docs/sphinx_doc/source/tutorial/example_multi_turn.md)
+Tutorials for adapting Trinity-RFT to a general multi-step agentic scenario:
++ [General Multi-Step tasks](./docs/sphinx_doc/source/tutorial/example_step_wise.md)
++ [ReAct agent tasks](./docs/sphinx_doc/source/tutorial/example_react.md)
 Tutorials for data-related functionalities:
@@ -353,10 +372,6 @@ Guidelines for developers and researchers:
-For some frequently asked questions, see [FAQ](./docs/sphinx_doc/source/tutorial/faq.md).
 ## Upcoming features

{trinity_rft-0.2.0 → trinity_rft-0.2.1.dev0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name =  "trinity-rft"
-version = "0.2.0"
+version = "0.2.1.dev0"
 authors = [
     {name="Trinity-RFT Team", email="trinity-rft@outlook.com"},
 ]
@@ -21,9 +21,9 @@ classifiers = [
 ]
 requires-python = ">=3.10"
 dependencies = [
-    "verl==0.4.0",
+    "verl==0.4.1",
     "ray[default]>=2.45.0",
-    "vllm==0.9.1",
+    "vllm>=0.9.1,<=0.10.0",
     "tensordict==0.6.2",
     "wandb",
     "omegaconf",
@@ -41,6 +41,8 @@ dependencies = [
     "openai",
     "jsonlines",
     "sortedcontainers",
+    "word2number",
+    "transformers<4.54.0",  # TODO: remove when https://github.com/vllm-project/vllm-ascend/issues/2046 is fixed
 ]
 [project.scripts]
@@ -66,7 +68,7 @@ dev = [
     "pytest>=8.0.0",
     "pytest-json-ctrf",
     "parameterized",
-    "matplotlib"
+    "matplotlib",
 ]
 doc = [
@@ -74,6 +76,8 @@ doc = [
     "sphinx-autobuild",
     "sphinx_rtd_theme",
     "myst-parser",
+    "sphinxcontrib-apidoc",
+    "sphinx-multiversion",
 ]
 flash_attn = [

{trinity_rft-0.2.0 → trinity_rft-0.2.1.dev0}/trinity/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
 # -*- coding: utf-8 -*-
 """Trinity-RFT (Reinforcement Fine-Tuning)"""
-__version__ = "0.2.0"
+__version__ = "0.2.1"

{trinity_rft-0.2.0 → trinity_rft-0.2.1.dev0}/trinity/algorithm/__init__.py RENAMED Viewed

@@ -1,3 +1,4 @@
+from trinity.algorithm.add_strategy import ADD_STRATEGY, AddStrategy
 from trinity.algorithm.advantage_fn import ADVANTAGE_FN, AdvantageFn
 from trinity.algorithm.algorithm import ALGORITHM_TYPE, AlgorithmType
 from trinity.algorithm.entropy_loss_fn import ENTROPY_LOSS_FN, EntropyLossFn
@@ -18,4 +19,6 @@ __all__ = [
     "ENTROPY_LOSS_FN",
     "SampleStrategy",
     "SAMPLE_STRATEGY",
+    "AddStrategy",
+    "ADD_STRATEGY",
 ]

trinity_rft-0.2.1.dev0/trinity/algorithm/add_strategy/__init__.py ADDED Viewed

@@ -0,0 +1,25 @@
+from trinity.algorithm.add_strategy.add_strategy import (
+    ADD_STRATEGY,
+    AddStrategy,
+    GRPOAddStrategy,
+    OPMDAddStrategy,
+    RewardVarianceAddStrategy,
+)
+from trinity.algorithm.add_strategy.correct_bias_add_strategy import (
+    CorrectBiasAddStrategy,
+)
+from trinity.algorithm.add_strategy.duplicate_add_strategy import (
+    DuplicateInformativeAddStrategy,
+)
+from trinity.algorithm.add_strategy.step_wise_add_strategy import StepWiseGRPOStrategy
+__all__ = [
+    "ADD_STRATEGY",
+    "AddStrategy",
+    "GRPOAddStrategy",
+    "OPMDAddStrategy",
+    "StepWiseGRPOStrategy",
+    "RewardVarianceAddStrategy",
+    "CorrectBiasAddStrategy",
+    "DuplicateInformativeAddStrategy",
+]

trinity_rft-0.2.1.dev0/trinity/algorithm/add_strategy/add_strategy.py ADDED Viewed

@@ -0,0 +1,230 @@
+import asyncio
+from abc import ABC, abstractmethod
+from typing import Dict, List, Literal, Tuple
+import numpy as np
+import torch
+from trinity.buffer import BufferWriter
+from trinity.common.experience import Experience
+from trinity.utils.monitor import gather_metrics
+from trinity.utils.registry import Registry
+from trinity.utils.timer import Timer
+ADD_STRATEGY = Registry("add_strategy")
+class AddStrategy(ABC):
+    def __init__(self, writer: BufferWriter, **kwargs) -> None:
+        self.writer = writer
+    @abstractmethod
+    async def add(self, experiences: List[Experience], step: int) -> Tuple[int, Dict]:
+        """Add experiences to the buffer.
+        Args:
+            experiences (`Experience`): The experiences to be added.
+            step (`int`): The current step number.
+        Returns:
+            `int`: The number of experiences added to the buffer.
+            `Dict`: Metrics for logging.
+        """
+    @classmethod
+    @abstractmethod
+    def default_args(cls) -> dict:
+        """Get the default arguments of the add strategy.
+        Returns:
+            `dict`: The default arguments.
+        """
+class GroupAdvantageStrategy(AddStrategy):
+    """An example AddStrategy that calculates group advantages."""
+    @abstractmethod
+    def group_experiences(self, exps: List[Experience]) -> Dict[str, List[Experience]]:
+        """Group experiences by a certain criterion.
+        Args:
+            exps (List[Experience]): List of experiences to be grouped.
+        Returns:
+            Dict[str, List[Experience]]: A dictionary where keys are group identifiers and values are lists of experiences.
+        """
+    @abstractmethod
+    def calculate_group_advantage(
+        self, group_id: str, exps: List[Experience]
+    ) -> Tuple[List[Experience], Dict]:
+        """Calculate advantages for a group of experiences.
+        Args:
+            group_id (str): The identifier for the group of experiences.
+            exps (List[Experience]): List of experiences in the group.
+        Returns:
+            Tuple[List[Experience], Dict]: A tuple containing the modified list of experiences and a dictionary of metrics.
+        """
+    async def add(self, exps: List[Experience], step: int) -> Tuple[int, Dict]:
+        if len(exps) == 0:
+            return 0, {}
+        exp_groups = self.group_experiences(exps)
+        cnt = 0
+        metric_list = []
+        tasks = []
+        for group_id, group_exps in exp_groups.items():
+            group_exps, group_metrics = self.calculate_group_advantage(group_id, group_exps)
+            metric_list.append(group_metrics)
+            cnt += len(group_exps)
+            if len(group_exps) > 0:
+                tasks.append(self.writer.write_async(group_exps))
+        if tasks:
+            await asyncio.gather(*tasks)
+        try:
+            metrics = gather_metrics(metric_list, "group_advantages")
+        except ValueError:
+            metrics = {}  # empty metric list causes ValueError, ignore it
+        return cnt, metrics
+@ADD_STRATEGY.register_module("grpo")
+class GRPOAddStrategy(GroupAdvantageStrategy):
+    """An example AddStrategy that calculates GRPO advantages."""
+    def __init__(self, writer: BufferWriter, epsilon: float = 1e-6, **kwargs) -> None:
+        super().__init__(writer)
+        self.epsilon = epsilon
+    def group_experiences(self, exps):
+        return group_by(exps, id_type="task")
+    def calculate_group_advantage(
+        self, group_id: str, exps: List[Experience]
+    ) -> Tuple[List[Experience], Dict]:
+        with torch.no_grad():
+            if len(exps) == 1:
+                group_reward_mean = torch.tensor(0.0)
+                group_reward_std = torch.tensor(1.0)
+            else:
+                rewards = torch.tensor([exp.reward for exp in exps], dtype=torch.float32)
+                group_reward_mean = torch.mean(rewards)
+                group_reward_std = torch.std(rewards)
+            for exp in exps:
+                score = (exp.reward - group_reward_mean) / (group_reward_std + self.epsilon)
+                exp.advantages = score * exp.action_mask
+                exp.returns = exp.advantages.clone()
+            metrics = {
+                "reward_mean": group_reward_mean.item(),
+                "reward_std": group_reward_std.item(),
+            }
+        return exps, metrics
+    @classmethod
+    def default_args(cls) -> dict:
+        return {"epsilon": 1e-6}
+@ADD_STRATEGY.register_module("opmd")
+class OPMDAddStrategy(GroupAdvantageStrategy):
+    """An example AddStrategy that calculates OPMD advantages."""
+    def __init__(
+        self, writer: BufferWriter, opmd_baseline: str = "mean", tau: float = 1.0, **kwargs
+    ) -> None:
+        super().__init__(writer)
+        assert opmd_baseline in [
+            "mean",
+            "logavgexp",
+        ], f"opmd_baseline must be 'mean' or 'logavgexp', got {opmd_baseline}"
+        self.opmd_baseline = opmd_baseline
+        self.tau = tau
+    def group_experiences(self, exps):
+        return group_by(exps, id_type="task")
+    def calculate_group_advantage(
+        self, group_id: str, exps: List[Experience]
+    ) -> Tuple[List[Experience], Dict]:
+        with torch.no_grad():
+            if len(exps) == 1:
+                group_baseline = torch.tensor(0.0)
+            else:
+                group_rewards = torch.tensor([exp.reward for exp in exps], dtype=torch.float32)
+                if self.opmd_baseline == "mean":
+                    group_baseline = torch.mean(group_rewards)
+                else:
+                    group_baseline = self.tau * (
+                        torch.logsumexp(group_rewards / self.tau, dim=-1)
+                        - torch.log(torch.tensor(len(exps)))
+                    )
+            for exp in exps:
+                score = exp.reward - group_baseline
+                exp.advantages = score * exp.action_mask
+                exp.returns = exp.advantages.clone()
+            metrics = {
+                "group_baseline": group_baseline,
+            }
+        return exps, metrics
+    @classmethod
+    def default_args(cls) -> dict:
+        return {"opmd_baseline": "mean", "tau": 1.0}
+@ADD_STRATEGY.register_module("reward_variance")
+class RewardVarianceAddStrategy(AddStrategy):
+    """An example AddStrategy that filters experiences based on a reward variance threshold."""
+    def __init__(self, writer: BufferWriter, variance_threshold: float = 0.0, **kwargs) -> None:
+        super().__init__(writer)
+        self.variance_threshold = variance_threshold
+    async def add(self, experiences: List[Experience], step: int) -> Tuple[int, Dict]:
+        cnt = 0
+        metrics = {}
+        tasks = []
+        with Timer(metrics, "add_strategy_time"):
+            grouped_experiences = group_by(experiences, id_type="task")
+            for _, group_exps in grouped_experiences.items():
+                if len(group_exps) < 2:
+                    continue
+                rewards = [exp.reward for exp in group_exps]
+                variance = np.var(rewards)
+                if variance <= self.variance_threshold:
+                    continue
+                cnt += len(group_exps)
+                tasks.append(self.writer.write_async(group_exps))
+            if tasks:
+                await asyncio.gather(*tasks)
+        return cnt, metrics
+    @classmethod
+    def default_args(cls) -> dict:
+        return {"variance_threshold": 0.0}
+def group_by(
+    experiences: List[Experience], id_type: Literal["task", "run", "step"]
+) -> Dict[str, List[Experience]]:
+    """Group experiences by ID."""
+    if id_type == "task":
+        id_type = "tid"
+    elif id_type == "run":
+        id_type = "rid"
+    elif id_type == "step":
+        id_type = "sid"
+    else:
+        raise ValueError(f"Unknown id_type: {id_type}")
+    grouped = {}
+    for exp in experiences:
+        group_id = getattr(exp.eid, id_type)
+        if group_id not in grouped:
+            grouped[group_id] = []
+        grouped[group_id].append(exp)
+    return grouped

trinity_rft-0.2.1.dev0/trinity/algorithm/add_strategy/correct_bias_add_strategy.py ADDED Viewed

@@ -0,0 +1,54 @@
+# -*- coding: utf-8 -*-
+from typing import Dict, List, Tuple
+import torch
+from trinity.algorithm.add_strategy.add_strategy import ADD_STRATEGY, GRPOAddStrategy
+from trinity.buffer import BufferWriter
+from trinity.common.experience import Experience
+@ADD_STRATEGY.register_module("correct_bias")
+class CorrectBiasAddStrategy(GRPOAddStrategy):
+    """An Addstrategy with GroupAdvantage that corrects for rank bias (https://arxiv.org/pdf/2506.02355)"""
+    def __init__(
+        self, writer: BufferWriter, epsilon: float = 1e-6, rank_penalty: float = 0.25, **kwargs
+    ) -> None:
+        super().__init__(writer, epsilon, **kwargs)
+        self.rank_penalty = rank_penalty
+    def calculate_group_advantage(
+        self, group_id: str, exps: List[Experience]
+    ) -> Tuple[List[Experience], Dict]:
+        with torch.no_grad():
+            rewards = torch.tensor([exp.reward for exp in exps], dtype=torch.float32)
+            if len(exps) == 1:
+                group_reward_mean = torch.tensor(0.0)
+                group_reward_std = torch.tensor(1.0)
+            else:
+                # correct bias
+                old_log_probs = torch.tensor([torch.mean(exp.logprobs, axis=-1) for exp in exps])
+                group_ranks = torch.argsort(torch.argsort(old_log_probs))
+                group_ranks = group_ranks / len(group_ranks)
+                rewards = rewards * (1 - group_ranks * self.rank_penalty)
+                group_reward_mean = torch.mean(rewards)
+                group_reward_std = torch.std(rewards)
+            for i, exp in enumerate(exps):
+                score = (rewards[i] - group_reward_mean) / (group_reward_std + self.epsilon)
+                exp.advantages = score * exp.action_mask
+                exp.returns = exp.advantages.clone()
+            metrics = {
+                "reward_mean": group_reward_mean.item(),
+                "reward_std": group_reward_std.item(),
+            }
+        return exps, metrics
+    @classmethod
+    def default_args(cls) -> dict:
+        return {"epsilon": 1e-6, "rank_penalty": 0.25}

trinity-rft 0.2.0__tar.gz → 0.2.1.dev0__tar.gz

trinity-rft 0.2.0tar.gz → 0.2.1.dev0tar.gz