PyPI - Stackformer - Versions diffs - 0.1.0__tar.gz - Mend

Stackformer 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

stackformer-0.1.0/LICENSE +21 -0
stackformer-0.1.0/PKG-INFO +75 -0
stackformer-0.1.0/README.md +54 -0
stackformer-0.1.0/Stackformer.egg-info/PKG-INFO +75 -0
stackformer-0.1.0/Stackformer.egg-info/SOURCES.txt +18 -0
stackformer-0.1.0/Stackformer.egg-info/dependency_links.txt +1 -0
stackformer-0.1.0/Stackformer.egg-info/requires.txt +2 -0
stackformer-0.1.0/Stackformer.egg-info/top_level.txt +2 -0
stackformer-0.1.0/models/GPT_2.py +238 -0
stackformer-0.1.0/models/__init__.py +0 -0
stackformer-0.1.0/modules/Attention.py +533 -0
stackformer-0.1.0/modules/Feed_forward.py +59 -0
stackformer-0.1.0/modules/Normalization.py +41 -0
stackformer-0.1.0/modules/__init__.py +0 -0
stackformer-0.1.0/modules/mask.py +36 -0
stackformer-0.1.0/modules/position_embedding.py +61 -0
stackformer-0.1.0/modules/tokenizer.py +25 -0
stackformer-0.1.0/pyproject.toml +25 -0
stackformer-0.1.0/setup.cfg +4 -0
stackformer-0.1.0/setup.py +31 -0

stackformer-0.1.0/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2025 GURUMURTHY
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

stackformer-0.1.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,75 @@
+Metadata-Version: 2.4
+Name: Stackformer
+Version: 0.1.0
+Summary: Modular transformer blocks built in PyTorch
+Home-page: https://github.com/Gurumurthy30/Stackformer
+Author: Gurumurthy
+Author-email: Gurumurthy <gurumurthy.00300@gmail.com>
+License: MIT
+Project-URL: Repository, https://github.com/Gurumurthy30/Stackformer
+Project-URL: Issue Tracker, https://github.com/Gurumurthy30/Stackformer/issues
+Project-URL: Discussions, https://github.com/Gurumurthy30/Stackformer/discussions
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: torch>=2.6
+Requires-Dist: tqdm>=4.67
+Dynamic: author
+Dynamic: home-page
+Dynamic: license-file
+Dynamic: requires-python
+## 🧱 Stackformer
+**Stackformer** is a modular transformer-building framework written entirely in PyTorch. It is designed primarily for experimentation, providing various transformer blocks such as attention mechanisms, normalization layers, feed-forward networks, and a simple model architecture. The project is a work-in-progress with plans for further enhancements and expansions.
+---
+## 📖 About Me
+My name is **Gurumurthy**, and I am a final-year Bachelor of Engineering student from India. I created this library as my own size project to showcase my skills and knowledge in deep learning and transformer architectures.
+I am also interested and free to work with others on different projects for knowledge sharing and building connections.
+---
+## 🌟 Features
+- Multiple attention mechanisms including multi-head, group query, linear, local, and KV cache variants
+- Token embedding via `tiktoken`
+- Absolute and sinusoidal positional embeddings
+- Normalization layers like LayerNorm and RMSNorm
+- Several feed-forward network variants with activations such as ReLU, GELU, SiLU, LeakyReLU, and Sigmoid
+- A simple GPT-style transformer model implementation
+---
+## 📁 Project Structure
+stackformer/ \
+|-- modules/ \
+|   |-- tokenizer.py            # Token embedding using tiktoken \
+|   |-- position_embedding.py   # Absolute and sinusoidal embeddings \
+|   |-- Attention.py            # Attention mechanisms \
+|   |-- Normalization.py        # LayerNorm and RMSNorm \
+|   |-- Feed_forward.py         # Feed-forward layers with various activations \
+|-- models/ \
+|   -- GPT_2.py               # GPT-style transformer stack model \
+-- trainer.py                 # Training loop and utilities \
+---
+## 💻 Installation
+Clone the repository and install in development mode:
+```bash
+git clone https://github.com/Gurumurthy30/Stackformer
+cd transformers
+pip install -e .
+```
+---
+## 🚀 Future Plans
+Currently, I am working on improving and optimizing the existing components while fixing known bugs and issues. After stabilizing the current modules, I plan to add more advanced blocks like Mixture of Experts (MoE), mask handling, and other essential transformer components. Eventually, I will expand the library by developing more comprehensive model architectures.

stackformer-0.1.0/README.md ADDED Viewed

@@ -0,0 +1,54 @@
+## 🧱 Stackformer
+**Stackformer** is a modular transformer-building framework written entirely in PyTorch. It is designed primarily for experimentation, providing various transformer blocks such as attention mechanisms, normalization layers, feed-forward networks, and a simple model architecture. The project is a work-in-progress with plans for further enhancements and expansions.
+---
+## 📖 About Me
+My name is **Gurumurthy**, and I am a final-year Bachelor of Engineering student from India. I created this library as my own size project to showcase my skills and knowledge in deep learning and transformer architectures.
+I am also interested and free to work with others on different projects for knowledge sharing and building connections.
+---
+## 🌟 Features
+- Multiple attention mechanisms including multi-head, group query, linear, local, and KV cache variants
+- Token embedding via `tiktoken`
+- Absolute and sinusoidal positional embeddings
+- Normalization layers like LayerNorm and RMSNorm
+- Several feed-forward network variants with activations such as ReLU, GELU, SiLU, LeakyReLU, and Sigmoid
+- A simple GPT-style transformer model implementation
+---
+## 📁 Project Structure
+stackformer/ \
+|-- modules/ \
+|   |-- tokenizer.py            # Token embedding using tiktoken \
+|   |-- position_embedding.py   # Absolute and sinusoidal embeddings \
+|   |-- Attention.py            # Attention mechanisms \
+|   |-- Normalization.py        # LayerNorm and RMSNorm \
+|   |-- Feed_forward.py         # Feed-forward layers with various activations \
+|-- models/ \
+|   -- GPT_2.py               # GPT-style transformer stack model \
+-- trainer.py                 # Training loop and utilities \
+---
+## 💻 Installation
+Clone the repository and install in development mode:
+```bash
+git clone https://github.com/Gurumurthy30/Stackformer
+cd transformers
+pip install -e .
+```
+---
+## 🚀 Future Plans
+Currently, I am working on improving and optimizing the existing components while fixing known bugs and issues. After stabilizing the current modules, I plan to add more advanced blocks like Mixture of Experts (MoE), mask handling, and other essential transformer components. Eventually, I will expand the library by developing more comprehensive model architectures.

stackformer-0.1.0/Stackformer.egg-info/PKG-INFO ADDED Viewed

@@ -0,0 +1,75 @@
+Metadata-Version: 2.4
+Name: Stackformer
+Version: 0.1.0
+Summary: Modular transformer blocks built in PyTorch
+Home-page: https://github.com/Gurumurthy30/Stackformer
+Author: Gurumurthy
+Author-email: Gurumurthy <gurumurthy.00300@gmail.com>
+License: MIT
+Project-URL: Repository, https://github.com/Gurumurthy30/Stackformer
+Project-URL: Issue Tracker, https://github.com/Gurumurthy30/Stackformer/issues
+Project-URL: Discussions, https://github.com/Gurumurthy30/Stackformer/discussions
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: torch>=2.6
+Requires-Dist: tqdm>=4.67
+Dynamic: author
+Dynamic: home-page
+Dynamic: license-file
+Dynamic: requires-python
+## 🧱 Stackformer
+**Stackformer** is a modular transformer-building framework written entirely in PyTorch. It is designed primarily for experimentation, providing various transformer blocks such as attention mechanisms, normalization layers, feed-forward networks, and a simple model architecture. The project is a work-in-progress with plans for further enhancements and expansions.
+---
+## 📖 About Me
+My name is **Gurumurthy**, and I am a final-year Bachelor of Engineering student from India. I created this library as my own size project to showcase my skills and knowledge in deep learning and transformer architectures.
+I am also interested and free to work with others on different projects for knowledge sharing and building connections.
+---
+## 🌟 Features
+- Multiple attention mechanisms including multi-head, group query, linear, local, and KV cache variants
+- Token embedding via `tiktoken`
+- Absolute and sinusoidal positional embeddings
+- Normalization layers like LayerNorm and RMSNorm
+- Several feed-forward network variants with activations such as ReLU, GELU, SiLU, LeakyReLU, and Sigmoid
+- A simple GPT-style transformer model implementation
+---
+## 📁 Project Structure
+stackformer/ \
+|-- modules/ \
+|   |-- tokenizer.py            # Token embedding using tiktoken \
+|   |-- position_embedding.py   # Absolute and sinusoidal embeddings \
+|   |-- Attention.py            # Attention mechanisms \
+|   |-- Normalization.py        # LayerNorm and RMSNorm \
+|   |-- Feed_forward.py         # Feed-forward layers with various activations \
+|-- models/ \
+|   -- GPT_2.py               # GPT-style transformer stack model \
+-- trainer.py                 # Training loop and utilities \
+---
+## 💻 Installation
+Clone the repository and install in development mode:
+```bash
+git clone https://github.com/Gurumurthy30/Stackformer
+cd transformers
+pip install -e .
+```
+---
+## 🚀 Future Plans
+Currently, I am working on improving and optimizing the existing components while fixing known bugs and issues. After stabilizing the current modules, I plan to add more advanced blocks like Mixture of Experts (MoE), mask handling, and other essential transformer components. Eventually, I will expand the library by developing more comprehensive model architectures.

stackformer-0.1.0/Stackformer.egg-info/SOURCES.txt ADDED Viewed

@@ -0,0 +1,18 @@
+LICENSE
+README.md
+pyproject.toml
+setup.py
+Stackformer.egg-info/PKG-INFO
+Stackformer.egg-info/SOURCES.txt
+Stackformer.egg-info/dependency_links.txt
+Stackformer.egg-info/requires.txt
+Stackformer.egg-info/top_level.txt
+models/GPT_2.py
+models/__init__.py
+modules/Attention.py
+modules/Feed_forward.py
+modules/Normalization.py
+modules/__init__.py
+modules/mask.py
+modules/position_embedding.py
+modules/tokenizer.py

stackformer-0.1.0/Stackformer.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@
1	+

stackformer-0.1.0/Stackformer.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ torch>=2.6
2	+ tqdm>=4.67

stackformer-0.1.0/Stackformer.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ models
2	+ modules

stackformer-0.1.0/models/GPT_2.py ADDED Viewed

@@ -0,0 +1,238 @@
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# --- position embedding ---
+class SinusoidalPositionalEmbedding(nn.Module):
+    def __init__(self, seq_len, emb_dim):
+        super().__init__()
+        self.seq_len = seq_len
+        self.emb_dim = emb_dim
+        position = torch.arange(0, seq_len).unsqueeze(1)
+        div_term = torch.exp(torch.arange(0, emb_dim, 2) * -(math.log(10000.0) / emb_dim))
+        pe = torch.zeros(seq_len, emb_dim)
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        self.register_buffer("pe", pe)
+    def forward(self, x):
+        # x shape: (batch_size, seq_len, emb_dim) or (batch_size, seq_len)
+        batch_size, seq_len = x.shape[0], x.shape[1]
+        return self.pe[:seq_len].unsqueeze(0).expand(batch_size, seq_len, -1).to(x.device)
+# --- Multi Head Attention ---
+class MultiHeadAttention(nn.Module):
+    def __init__(self, Emb_dim, num_heads, dropout=0.1, device='cpu', dtype=torch.float32):
+        super().__init__()
+        assert Emb_dim % num_heads == 0, "Emb_dim must be divisible by num_heads"
+        self.Emb_dim = Emb_dim
+        self.device = device
+        self.num_heads = num_heads
+        self.head_dim = Emb_dim // num_heads
+        self.key = nn.Linear(Emb_dim, Emb_dim, bias=False, dtype=dtype, device=device)
+        self.query = nn.Linear(Emb_dim, Emb_dim, bias=False, dtype=dtype, device=device)
+        self.value = nn.Linear(Emb_dim, Emb_dim, bias=False, dtype=dtype, device=device)
+        self.scale = math.sqrt(self.head_dim)
+        self.out_proj = nn.Linear(Emb_dim, Emb_dim, dtype=dtype, device=device)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        batch_size, seq_len, _ = x.shape
+        # Generate Q, K, V and reshape for multi-head attention
+        keys = self.key(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        queries = self.query(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        values = self.value(x).view(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        # Compute attention scores
+        scores = (queries @ keys.transpose(-2, -1)) / self.scale
+        # Create causal mask dynamically based on current sequence length
+        causal_mask = torch.triu(torch.ones(seq_len, seq_len, dtype=torch.bool, device=x.device), diagonal=1)
+        scores = scores.masked_fill(causal_mask[None, None, :, :], float('-inf'))
+        # Apply softmax and dropout
+        attn = F.softmax(scores, dim=-1)
+        attn = self.dropout(attn)
+        # Apply attention to values
+        out = attn @ values
+        # Concatenate heads and project
+        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, self.Emb_dim)
+        return self.out_proj(out)
+# --- Feed Forward ---
+class FF_ReLU(nn.Module):
+    def __init__(self, Emb_dim, hidden_dim, dropout=0.1, device='cpu', dtype=torch.float32):
+        super().__init__()
+        self.relu = nn.Sequential(
+            nn.Linear(Emb_dim, hidden_dim, device=device, dtype=dtype),
+            nn.ReLU(),
+            nn.Dropout(dropout),
+            nn.Linear(hidden_dim, Emb_dim, device=device, dtype=dtype),
+        )
+    def forward(self, x):
+        return self.relu(x)
+class LayerNorm(nn.Module):
+    def __init__(self, Emb_dim, eps=1e-5, device='cpu', dtype=torch.float32):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(Emb_dim, device=device, dtype=dtype))
+        self.bias = nn.Parameter(torch.zeros(Emb_dim, device=device, dtype=dtype))
+    def forward(self, x):
+        mean = x.mean(dim=-1, keepdim=True)
+        var = x.var(dim=-1, keepdim=True, unbiased=False)
+        norm_x = (x - mean) / torch.sqrt(var + self.eps)
+        return norm_x * self.weight + self.bias
+# --- Transformer Block ---
+class Block(nn.Module):
+    def __init__(self, Emb_dim, num_heads, dropout, hidden_dim, eps=1e-5, device='cpu', dtype=torch.float32):
+        super().__init__()
+        self.attention = MultiHeadAttention(Emb_dim, num_heads, dropout, device=device, dtype=dtype)
+        self.norm1 = LayerNorm(Emb_dim, eps=eps, device=device, dtype=dtype)
+        self.ff_relu = FF_ReLU(Emb_dim, hidden_dim, dropout, device=device, dtype=dtype)
+        self.norm2 = LayerNorm(Emb_dim, eps=eps, device=device, dtype=dtype)
+    def forward(self, x):
+        # Pre-norm: normalize before attention
+        residual = x
+        x = self.norm1(x)
+        x = self.attention(x)
+        x = x + residual  # Residual connection
+        # Pre-norm: normalize before FF
+        residual = x
+        x = self.norm2(x)
+        x = self.ff_relu(x)
+        x = x + residual  # Residual connection
+        return x
+# --- Encoder ---
+class Encoder(nn.Module):
+    def __init__(self, num_layers, Emb_dim, num_heads, dropout, hidden_dim, eps=1e-5, device='cpu', dtype=torch.float32):
+        super().__init__()
+        self.layers = nn.ModuleList([
+            Block(Emb_dim, num_heads, dropout, hidden_dim, eps, device=device, dtype=dtype)
+            for _ in range(num_layers)
+        ])
+    def forward(self, x):
+        for layer in self.layers:
+            x = layer(x)
+        return x
+class GPTModel(nn.Module):
+    def __init__(self, vocab_size, num_layers, Emb_dim, num_heads, seq_len,
+            dropout, hidden_dim, eps=1e-5, device='cpu', dtype=torch.float32):
+        super().__init__()
+        # --- Token embedding ---
+        self.embedding = nn.Embedding(vocab_size, Emb_dim, dtype=self.dtype, device=self.device)
+        # --- Embedding dropout ---
+        self.emb_dropout = nn.Dropout(dropout)
+        # --- Adaptive position embedding ---
+        self.position_embedding = SinusoidalPositionalEmbedding(
+            emb_dim=Emb_dim,
+            seq_len=seq_len
+        )
+        # --- Encoder ---
+        self.encoder = Encoder(
+            num_layers=num_layers,
+            Emb_dim=Emb_dim,
+            num_heads=num_heads,
+            dropout=dropout,
+            hidden_dim=hidden_dim,
+            eps=eps,
+            device=self.device,
+            dtype=self.dtype
+        )
+        # --- Final norm
+        self.final_norm = LayerNorm(Emb_dim, eps=eps,
+                            device=self.device, dtype=self.dtype)
+        # --- Output Projection ---
+        self.lm_head = nn.Linear(Emb_dim, vocab_size, bias=False,
+                    dtype=self.dtype, device=self.device)
+    def forward(self, x):
+        # x shape: (batch_size, seq_len)
+        emb = self.embedding(x)  # (batch_size, seq_len, emb_dim)
+        pos = self.position_embedding(x)  # (batch_size, seq_len, emb_dim)
+        x = emb + pos
+        x = self.emb_dropout(x)
+        x = self.encoder(x)
+        x = self.final_norm(x)
+        x = self.lm_head(x)
+        return x
+    @torch.no_grad()
+    def generate(self, prompt_ids, max_new_tokens=50, temperature=1.0, top_k=None, top_p=1.0, eos_token_id=None):
+        self.eval()
+        if prompt_ids.dim() == 1:
+            prompt_ids = prompt_ids.unsqueeze(0)  # (1, seq_len)
+        generated = prompt_ids.clone()
+        max_context_len = self.seq_len
+        for _ in range(max_new_tokens):
+            # Use sliding window if sequence gets too long
+            if generated.size(1) > max_context_len:
+                input_ids = generated[:, -max_context_len:]
+            else:
+                input_ids = generated
+            logits = self.forward(input_ids)  # (batch_size, seq_len, vocab_size)
+            logits = logits[:, -1, :]  # (batch_size, vocab_size)
+            # --- Temperature scaling ---
+            if temperature != 1.0:
+                logits = logits / temperature
+            # --- Top-k filtering ---
+            if top_k is not None and top_k > 0:
+                topk_vals, topk_indices = torch.topk(logits, top_k)
+                mask = torch.full_like(logits, float('-inf'))
+                mask.scatter_(dim=-1, index=topk_indices, src=topk_vals)
+                logits = mask
+            # --- Top-p ---
+            if top_p < 1.0:
+                sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
+                probs = F.softmax(sorted_logits, dim=-1)
+                cum_probs = torch.cumsum(probs, dim=-1)
+                sorted_mask = cum_probs > top_p
+                sorted_mask[..., 1:] = sorted_mask[..., :-1].clone()
+                sorted_mask[..., 0] = 0
+                indices_to_remove = sorted_mask.scatter(dim=-1, index=sorted_indices, src=sorted_mask)
+                logits = logits.masked_fill(indices_to_remove, float('-inf'))
+            # Sample next token
+            probs = F.softmax(logits, dim=-1)
+            next_token = torch.multinomial(probs, num_samples=1)  # (batch_size, 1)
+            generated = torch.cat([generated, next_token], dim=-1)
+            # check if we've reached the end of the sequence
+            if eos_token_id is not None and next_token.item() == eos_token_id:
+                break
+        return generated

stackformer-0.1.0/models/__init__.py ADDED Viewed

File without changes

stackformer-0.1.0/modules/Attention.py ADDED Viewed

@@ -0,0 +1,533 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class Self_Attention(nn.Module):
+    def __init__(self, Emb_dim, dropout,dtype=torch.float32,device='cpu'):
+        super().__init__()
+        self.device = device
+        self.scale = torch.tensor(Emb_dim ** 0.5,dtype=dtype,device=device)
+        self.key = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.query = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.value = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.out_proj = nn.Linear(Emb_dim, Emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        Batch_size, Seq_len, Emb_dim = x.size()
+        Querys = self.query(x)  # (Batch_size, Seq_len, D)
+        Keys = self.key(x)    # (Batch_size, Seq_len, D)
+        Values = self.value(x)  # (Batch_size, Seq_len, D)
+        # Attention scores
+        scores = Querys @ Keys.transpose(-2, -1) / self.scale  # (Batch_size, Seq_len, Seq_len)
+        causal_mask = torch.triu(torch.ones(Seq_len, Seq_len, dtype=torch.bool, device=self.device), diagonal=1)
+        scores = scores.masked_fill_(causal_mask, float('-inf'))  # Mask *future* tokens
+        attn = F.softmax(scores, dim=-1)  # (Batch_size, Seq_len, Seq_len)
+        attn = self.dropout(attn)
+        out = (attn @ Values) # (Batch_size, Seq_len, D)
+        return self.out_proj(out)  # (Batch_size, Seq_len, D)
+class Multi_Head_Attention(nn.Module):
+    def __init__(self, Emb_dim, num_heads, dropout, device='cpu',dtype=torch.float32):
+        super().__init__()
+        assert Emb_dim % num_heads == 0, "Emb_dim must be divisible by num_heads"
+        self.Emb_dim = Emb_dim
+        self.num_heads = num_heads
+        self.device = device
+        self.head_dim = Emb_dim // num_heads
+        self.key = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.query = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.value = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.scale = torch.tensor(self.head_dim ** 0.5,device=device,dtype=dtype)
+        self.out_proj = nn.Linear(Emb_dim, Emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        Batch_size, Seq_len, _ = x.shape
+        # Generate Q, K, V and reshape for multi-head attention
+        Keys = self.key(x).view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)  # (Batch_size, nh, Seq_len, hd)
+        Querys = self.query(x).view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        Values = self.value(x).view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        # Compute attention scores
+        scores = (Querys @ Keys.transpose(-2, -1)) / self.scale  # (Batch_size, nh, Seq_len, Seq_len)
+        # Apply causal mask if requested
+        causal_mask = torch.triu(torch.ones(Seq_len, Seq_len, dtype=torch.bool, device=self.device), diagonal=1)
+        scores = scores.masked_fill_(causal_mask[None, None, :, :], float('-inf'))
+        # Apply softmax and dropout
+        attn = F.softmax(scores, dim=-1)
+        attn = self.dropout(attn)
+        # Apply attention to values
+        out = attn @ Values  # (Batch_size, nh, Seq_len, hd)
+        # Concatenate heads and project
+        out = out.transpose(1, 2).contiguous().view(Batch_size, Seq_len, self.Emb_dim)  # (Batch_size, Seq_len, Emb_dim)
+        return self.out_proj(out)
+class Cross_MultiHead_Attention(nn.Module):
+    def __init__(self, Emb_dim, num_heads, dropout,device='cpu', dtype=torch.float32):
+        super().__init__()
+        assert Emb_dim % num_heads == 0, "Emb_dim must be divisible by num_heads"
+        self.Emb_dim = Emb_dim
+        self.device = device
+        self.num_heads = num_heads
+        self.head_dim = Emb_dim // num_heads
+        # Querys, Key, Value projections
+        self.query = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.key = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.value = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.scale = torch.tensor(self.head_dim ** 0.5,device=device,dtype=dtype)
+        self.out_proj = nn.Linear(Emb_dim, Emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x, context=None):
+        """
+        x: (Batch_size, query_seq_len, Emb_dim) — query input (e.g., decoder hidden states)
+        context: (Batch_size, KV_seq_len, Emb_dim) — source for keys/values (e.g., encoder output). If None, self-attention.
+        mask: (Batch_size, 1, query_seq_len, KV_seq_len) — optional attention mask
+        """
+        Batch_size, query_seq_len, _ = x.shape
+        context = x if context is None else context  # self-attention fallback
+        KV_seq_len = context.shape[1]
+        # Project Q, K, V
+        Querys = self.query(x).view(Batch_size, query_seq_len, self.num_heads, self.head_dim).transpose(1, 2)  # (Batch_size, nh, query_seq_len, hd)
+        Keys = self.key(context).view(Batch_size, KV_seq_len, self.num_heads, self.head_dim).transpose(1, 2)  # (Batch_size, nh, KV_seq_len, hd)
+        Values = self.value(context).view(Batch_size, KV_seq_len, self.num_heads, self.head_dim).transpose(1, 2)  # (Batch_size, nh, KV_seq_len, hd)
+        # Attention scores
+        scores = (Querys @ Keys.transpose(-2, -1)) / self.scale  # (Batch_size, nh, query_seq_len, KV_seq_len)
+        causal_mask = torch.triu(torch.ones(query_seq_len, query_seq_len, dtype=torch.bool, device=self.device), diagonal=1)
+        scores = scores.masked_fill_(causal_mask[None, None, :, :], float('-inf'))
+        attn = F.softmax(scores, dim=-1)
+        attn = self.dropout(attn)
+        out = attn @ Values  # (Batch_size, nh, query_seq_len, hd)
+        out = out.transpose(1, 2).contiguous().view(Batch_size, query_seq_len, self.Emb_dim)  # (Batch_size, query_seq_len, Emb_dim)
+        return self.out_proj(out)
+class Multi_query_Attention(nn.Module):
+    def __init__(self, Emb_dim, num_heads, dropout, device='cpu', dtype=torch.float32):
+        super().__init__()
+        assert Emb_dim % num_heads == 0, "Emb_dim must be divisible by num_heads"
+        self.Emb_dim = Emb_dim
+        self.device = device
+        self.num_heads = num_heads
+        self.head_dim = Emb_dim // num_heads
+        self.query = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.key = nn.Linear(Emb_dim, self.head_dim, bias=False,dtype=dtype,device=device)
+        self.value = nn.Linear(Emb_dim, self.head_dim, bias=False,dtype=dtype,device=device)
+        self.scale = torch.tensor(self.head_dim ** 0.5,device=device,dtype=dtype)
+        self.out_proj = nn.Linear(Emb_dim, Emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        Batch_size, Seq_len, C = x.shape
+        # Generate Q, K, V and reshape for Multiquery_Attention
+        Querys = self.query(x)
+        Keys = self.key(x)
+        Values = self.value(x)
+        Querys = Querys.view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)  # (Batch_size, nh, Seq_len, hd)
+        Keys = Keys.unsqueeze(1).expand(Batch_size, 1, Seq_len, self.head_dim)    # (Batch_size, 1, Seq_len, hd)
+        Values = Values.unsqueeze(1).expand(Batch_size, 1, Seq_len, self.head_dim)
+        # Compute attention scores
+        scores = (Querys @ Keys.transpose(-2, -1)) / self.scale  # (Batch_size, nh, Seq_len, Seq_len)
+        # Apply causal mask if requested
+        causal_mask = torch.triu(torch.ones(Seq_len, Seq_len, dtype=torch.bool, device=self.device), diagonal=1)
+        scores = scores.masked_fill_(causal_mask[None, None, :, :], float('-inf'))
+        # Apply softmax and dropout
+        attn = F.softmax(scores, dim=-1)
+        attn = self.dropout(attn)
+        # Apply attention to values
+        out = attn @ Values  # (Batch_size, nh, Seq_len, hd)
+        # Concatenate heads and project
+        out = out.transpose(1, 2).contiguous().view(Batch_size, Seq_len, self.Emb_dim)  # (Batch_size, Seq_len, Emb_dim)
+        return self.out_proj(out)
+class Group_query_Attention(nn.Module):
+    def __init__(self, Emb_dim, num_query_heads, num_kv_heads, dropout,device='cpu', dtype=torch.float32):
+        super().__init__()
+        assert Emb_dim % num_query_heads == 0, "Emb_dim must be divisible by num_heads"
+        self.Emb_dim = Emb_dim
+        self.device = device
+        self.num_query_heads = num_query_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = Emb_dim // num_query_heads
+        self.num_queries_pre_kv = num_query_heads // num_kv_heads
+        self.query = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.key = nn.Linear(Emb_dim, self. num_kv_heads * self.head_dim, bias=False,dtype=dtype,device=device)
+        self.value = nn.Linear(Emb_dim, self.num_kv_heads * self.head_dim, bias=False,dtype=dtype,device=device)
+        self.scale = torch.tensor(self.head_dim ** 0.5,device=device,dtype=dtype)
+        self.out_proj = nn.Linear(Emb_dim, Emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        Batch_size, Seq_len, C = x.shape
+        # Generate Q, K, V and reshape for Multiquery_Attention
+        Querys = self.query(x).view(Batch_size, Seq_len, self.num_query_heads, self.head_dim).transpose(1, 2)  # (Batch_size, nqh, Seq_len, hd)
+        Keys = self.key(x).view(Batch_size, Seq_len, self.num_kv_heads, self.head_dim).transpose(1, 2)     # (Batch_size, nkvh, Seq_len, hd)
+        Values = self.value(x).view(Batch_size, Seq_len, self.num_kv_heads, self.head_dim).transpose(1, 2)   # (Batch_size, nkvh, Seq_len, hd)
+        Keys = Keys.repeat_interleave(self.num_queries_pre_kv,dim=1)
+        Values = Values.repeat_interleave(self.num_queries_pre_kv,dim=1)
+        # Compute attention scores
+        scores = (Querys @ Keys.transpose(-2, -1)) / self.scale  # (Batch_size, nh, Seq_len, Seq_len)
+        # Apply causal mask if requested
+        causal_mask = torch.triu(torch.ones(Seq_len, Seq_len, dtype=torch.bool, device=self.device), diagonal=1)
+        scores = scores.masked_fill_(causal_mask[None, None, :, :], float('-inf'))
+        # Apply softmax and dropout
+        attn = F.softmax(scores, dim=-1)
+        attn = self.dropout(attn)
+        # Apply attention to values
+        out = attn @ Values  # (Batch_size, nh, Seq_len, hd)
+        # Concatenate heads and project
+        out = out.transpose(1, 2).contiguous().view(Batch_size, Seq_len, self.Emb_dim)  # (Batch_size, Seq_len, Emb_dim)
+        return self.out_proj(out)
+class Linear_Attention(nn.Module):
+    def __init__(self, Emb_dim, num_heads, dropout, eps = 1e-5, device='cpu', dtype=torch.float32):
+        super().__init__()
+        assert Emb_dim % num_heads == 0, "Emb_dim must be divisible by num_heads"
+        self.Emb_dim = Emb_dim
+        self.eps = eps
+        self.device = device
+        self.dtype=dtype
+        self.num_heads = num_heads
+        self.head_dim = Emb_dim // self.num_heads
+        self.query = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.key = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.value = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.out_proj = nn.Linear(Emb_dim, Emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        Batch_size, Seq_len, _ = x.shape
+        # Generate Q, K, V and reshape for multi-head attention
+        Querys = self.query(x).view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)     # (Batch_size, nh, Seq_len, hd)
+        Keys = self.key(x).view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        Values = self.value(x).view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        phi_q = F.elu(Querys) + 1.0
+        phi_k = F.elu(Keys) + 1.0
+        kv_outer_product = torch.matmul(phi_k.unsqueeze(-1),Values.unsqueeze(-2))
+        s_cumulative = torch.cumsum(kv_outer_product, dim=2)
+        z_cumulative = torch.cumsum(phi_k,dim=2)
+        numerator = torch.matmul(phi_q.unsqueeze(-2),s_cumulative).squeeze(-2)
+        denominator = torch.sum(phi_q * z_cumulative,dim=-1,keepdim=True) + self.eps
+        out = numerator / denominator
+        out = out.transpose(1, 2).contiguous().view(Batch_size, Seq_len, self.Emb_dim)  # (Batch_size, Seq_len, Emb_dim)
+        out = self.out_proj(out)
+        return self.dropout(out)
+class Multi_latent_Attention(nn.Module):
+    def __init__(self, Emb_dim, q_compressed_dim, kv_compressed_dim , num_heads,device='cpu' ,dtype=torch.float32, dropout=0):
+        super().__init__()
+        assert Emb_dim % num_heads == 0, "Emb_dim must be divisible by num_heads"
+        self.Emb_dim = Emb_dim
+        self.device = device
+        self.q_compressed_dim = q_compressed_dim
+        self.kv_compressed_dim = kv_compressed_dim
+        self.num_heads = num_heads
+        self.head_dim = Emb_dim // self.num_heads
+        self.W_dq = nn.Linear(Emb_dim,q_compressed_dim,bias=False,dtype=dtype,device=device)
+        self.W_dq_norm = nn.LayerNorm(q_compressed_dim,dtype=dtype,device=device)
+        self.W_uq = nn.Linear(q_compressed_dim,Emb_dim,bias=False,dtype=dtype,device=device)
+        self.W_dkv = nn.Linear(Emb_dim,kv_compressed_dim,bias=False,dtype=dtype,device=device)
+        self.W_dkv_norm = nn.LayerNorm(kv_compressed_dim,dtype=dtype,device=device)
+        self.W_uk = nn.Linear(kv_compressed_dim,Emb_dim,dtype=dtype,device=device)
+        self.W_uv = nn.Linear(kv_compressed_dim,Emb_dim,dtype=dtype,device=device)
+        self.out_proj = nn.Linear(Emb_dim, Emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        Batch_size, Seq_len, C = x.shape
+        compressed_q_latent = self.W_dq(x)
+        compressed_q_latent_norm = self.W_dq_norm(compressed_q_latent)
+        q_final = self.W_uq(compressed_q_latent_norm)
+        compressed_kv_latent = self.W_dkv(x)
+        compressed_kv_latent_norm = self.W_dkv_norm(compressed_kv_latent)
+        k_final = self.W_uk(compressed_kv_latent_norm)
+        v_final = self.W_uv(compressed_kv_latent_norm)
+        Querys = q_final.view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        Keys = k_final.view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        Values = v_final.view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        out = F.scaled_dot_product_attention(
+            query=Querys,
+            key=Keys,
+            value=Values,
+            attn_mask=None,
+            is_causal=True,
+            dropout_p=self.dropout.p  # use self.dropout.p to get dropout prob
+        )
+        out = self.out_proj(out.transpose(1, 2).contiguous().view(Batch_size, Seq_len, C))
+        out = self.dropout(out)
+        return out
+class Local_Attention(nn.Module):
+    def __init__(self, Emb_dim, num_heads, Window_size ,dropout, device='cpu',dtype=torch.float32):
+        super().__init__()
+        assert Emb_dim % num_heads == 0, "Emb_dim must be divisible by num_heads"
+        self.Emb_dim = Emb_dim
+        self.Window_size = Window_size
+        self.device = device
+        self.num_heads = num_heads
+        self.head_dim = Emb_dim // num_heads
+        self.key = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.query = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.value = nn.Linear(Emb_dim, Emb_dim, bias=False,dtype=dtype,device=device)
+        self.scale = torch.tensor(self.head_dim ** 0.5,device=device,dtype=dtype)
+        self.out_proj = nn.Linear(Emb_dim, Emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x):
+        Batch_size, Seq_len, _ = x.shape
+        # Generate Q, K, V and reshape for multi-head attention
+        Keys = self.key(x).view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)  # (Batch_size, nh, Seq_len, hd)
+        Querys = self.query(x).view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        Values = self.value(x).view(Batch_size, Seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+        # Compute attention scores
+        scores = (Querys @ Keys.transpose(-2, -1)) / self.scale  # (Batch_size, nh, Seq_len, Seq_len)
+        # Apply sliding window mask
+        casual = torch.tril(torch.ones_like(scores,dtype=bool))
+        band = torch.triu(casual, diagonal=-(self.Window_size-1))
+        scores = scores.masked_fill_(~band, float('-inf'))
+        # Apply softmax and dropout
+        attn = F.softmax(scores, dim=-1)
+        attn = self.dropout(attn)
+        # Apply attention to values
+        out = attn @ Values  # (Batch_size, nh, Seq_len, hd)
+        # Concatenate heads and project
+        out = out.transpose(1, 2).contiguous().view(Batch_size, Seq_len, self.Emb_dim)  # (Batch_size, Seq_len, Emb_dim)
+        return self.out_proj(out)
+def precompute_theta_position_frequency(head_dim, seq_len, device='cpu', theta=10000.0):
+    assert head_dim % 2 == 0, "head_dim must be even"
+    # Frequencies: 1 / (theta ** (2i / head_dim))
+    theta_numerator = torch.arange(0, head_dim, 2, device=device)
+    inv_freq = 1.0 / (theta ** (theta_numerator / head_dim))
+    # Position indices
+    m = torch.arange(seq_len, device=device)
+    # Outer product: (seq_len, head_dim // 2)
+    freqs = torch.outer(m, inv_freq)
+    # Convert to complex exponential form: exp(i * freq)
+    freq_complex = torch.polar(torch.ones_like(freqs), freqs)
+    return freq_complex
+def apply_rotry_position_embedding(x, freq_complex, device='cpu', dtype=torch.float32):
+    # x: (batch_size, seq_len, num_head, emb_dim)
+    batch_size, seq_len, num_head, emb_dim = x.shape
+    assert emb_dim % 2 == 0, "emb_dim must be even"
+    # Reshape to split last dimension into complex pairs
+    x_reshaped = x.view(batch_size, seq_len, num_head, emb_dim // 2, 2).to(device=device, dtype=dtype)
+    x_complex = torch.view_as_complex(x_reshaped)
+    # Prepare frequencies: (1, seq_len, 1, emb_dim//2)
+    freq_complex = freq_complex[:seq_len].unsqueeze(0).unsqueeze(2).to(device=device)
+    # Apply rotation
+    x_rotated = x_complex * freq_complex
+    # Convert back to real tensor and reshape
+    x_out = torch.view_as_real(x_rotated).contiguous().view(batch_size, seq_len, num_head, emb_dim)
+    return x_out.to(device=device, dtype=dtype)
+class kv_cache_multihead(nn.Module):
+    def __init__(self, emb_dim, num_heads, batch_size, kv_seq_len, device='cpu', dtype=torch.float32,dropout=0.1):
+        super().__init__()
+        self.dtype = dtype
+        self.device = device
+        assert emb_dim % num_heads == 0
+        self.emb_dim = emb_dim
+        self.num_heads = num_heads
+        self.head_dim = emb_dim // num_heads
+        self.kv_seq_len = kv_seq_len
+        self.query = nn.Linear(emb_dim, emb_dim, bias=False,dtype=dtype,device=device)
+        self.key = nn.Linear(emb_dim, emb_dim, bias=False,dtype=dtype,device=device)
+        self.value = nn.Linear(emb_dim, emb_dim, bias=False,dtype=dtype,device=device)
+        self.out_proj = nn.Linear(emb_dim, emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+        self.cache_keys = torch.zeros(batch_size, kv_seq_len, num_heads, self.head_dim,dtype=dtype,device=device)
+        self.cache_value = torch.zeros(batch_size, kv_seq_len, num_heads, self.head_dim,dtype=dtype,device=device)
+    def forward(self, x, start_pos, RoPE: False):
+        batch_size, seq_len, C = x.shape
+        xq = self.query(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
+        xk = self.key(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
+        xv = self.value(x).view(batch_size, seq_len, self.num_heads, self.head_dim)
+        if RoPE:
+            freq_complex = precompute_theta_position_frequency(head_dim=self.head_dim, seq_len=seq_len, device=self.device)
+            xq = apply_rotry_position_embedding(xq, freq_complex, device=self.device, dtype=self.dtype)
+            freq_complex = precompute_theta_position_frequency(head_dim=self.head_dim, seq_len=self.kv_seq_len, device=self.device)
+            xk = apply_rotry_position_embedding(xk, freq_complex, device=self.device, dtype=self.dtype)
+        # Cache keys and values
+        self.cache_keys[:, start_pos:start_pos+seq_len] = xk
+        self.cache_value[:, start_pos:start_pos+seq_len] = xv
+        xk_full = self.cache_keys[:, :start_pos+seq_len]
+        xv_full = self.cache_value[:, :start_pos+seq_len]
+        query = xq.transpose(1, 2)         # (batch_size, num_head, seq_len, emb_dim)
+        key = xk_full.transpose(1, 2)    # (batch_size, num_head, T_total, emb_dim)
+        value = xv_full.transpose(1, 2)    # (batch_size, num_head, T_total, emb_dim)
+        attn_scores = torch.matmul(query, key.transpose(2, 3)) / (self.head_dim ** 0.5)
+        # Causal mask
+        causal_mask = torch.triu(torch.ones(seq_len, seq_len, dtype=torch.bool, device=self.device), diagonal=1)
+        attn_scores.masked_fill_(causal_mask[None, None, :, :], float('-inf'))
+        attn_weights = F.softmax(attn_scores, dim=-1)
+        out = torch.matmul(attn_weights, value)
+        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, -1)
+        return self.dropout(self.out_proj(out))
+class kv_cache_group_query(nn.Module):
+    def __init__(self, emb_dim, query_num_heads, kv_num_heads, batch_size, kv_seq_len,device='cpu' , dtype=torch.float32 , dropout=0.1):
+        super().__init__()
+        self.dtype = dtype
+        self.device = device
+        assert query_num_heads % kv_num_heads == 0, "query heads must be divisible by kv heads"
+        assert emb_dim % query_num_heads == 0, "embedding must be divisible by query heads"
+        self.emb_dim = emb_dim
+        self.query_num_heads = query_num_heads
+        self.kv_num_heads = kv_num_heads
+        self.head_dim = emb_dim // query_num_heads
+        self.num_queries_per_kv = query_num_heads // kv_num_heads
+        self.kv_seq_len = kv_seq_len
+        self.query = nn.Linear(emb_dim, emb_dim, bias=False,dtype=dtype,device=device)
+        self.key = nn.Linear(emb_dim, kv_num_heads * self.head_dim, bias=False,dtype=dtype,device=device)
+        self.value = nn.Linear(emb_dim, kv_num_heads * self.head_dim, bias=False,dtype=dtype,device=device)
+        self.out_proj = nn.Linear(emb_dim, emb_dim,dtype=dtype,device=device)
+        self.dropout = nn.Dropout(dropout)
+        # KV caches
+        self.register_buffer("cache_keys", torch.zeros(batch_size, kv_seq_len, kv_num_heads, self.head_dim,device=device,dtype=dtype))
+        self.register_buffer("cache_value", torch.zeros(batch_size, kv_seq_len, kv_num_heads, self.head_dim,device=device,dtype=dtype))
+    def forward(self, x, start_pos, RoPE=False):
+        batch_size, seq_len, _ = x.shape
+        xq = self.query(x).view(batch_size, seq_len, self.query_num_heads, self.head_dim)
+        xk = self.key(x).view(batch_size, seq_len, self.kv_num_heads, self.head_dim)
+        xv = self.value(x).view(batch_size, seq_len, self.kv_num_heads, self.head_dim)
+        if RoPE:
+            freq_q = precompute_theta_position_frequency(head_dim=self.head_dim, seq_len=seq_len, device=self.device)
+            xq = apply_rotry_position_embedding(xq, freq_q, device=self.device, dtype=self.dtype)
+            freq_k = precompute_theta_position_frequency(head_dim=self.head_dim, seq_len=self.kv_seq_len, device=self.device)
+            xk = apply_rotry_position_embedding(xk, freq_k, device=self.device, dtype=self.dtype)
+        # Cache
+        self.cache_keys[:, start_pos:start_pos+seq_len] = xk
+        self.cache_value[:, start_pos:start_pos+seq_len] = xv
+        xk_full = self.cache_keys[:, :start_pos+seq_len]  # [B, T, kv_heads, D]
+        xv_full = self.cache_value[:, :start_pos+seq_len]
+        # Transpose for attention: [B, H, T, D]
+        query = xq.transpose(1, 2)  # [B, q_heads, seq_len, D]
+        key = xk_full.transpose(1, 2)  # [B, kv_heads, total_kv_len, D]
+        value = xv_full.transpose(1, 2)
+        # Repeat keys and values to match query heads
+        key = key.repeat_interleave(self.num_queries_per_kv, dim=1)
+        value = value.repeat_interleave(self.num_queries_per_kv, dim=1)
+        # Attention
+        attn_scores = torch.matmul(query, key.transpose(2, 3)) / (self.head_dim ** 0.5)
+        # Causal mask
+        causal_mask = torch.triu(torch.ones(seq_len, seq_len, dtype=torch.bool, device=self.device), diagonal=1)
+        attn_scores.masked_fill_(causal_mask[None, None, :, :], float('-inf'))
+        attn_weights = F.softmax(attn_scores, dim=-1)
+        out = torch.matmul(attn_weights, value)
+        out = out.transpose(1, 2).contiguous().view(batch_size, seq_len, self.emb_dim)
+        return self.dropout(self.out_proj(out))

stackformer-0.1.0/modules/Feed_forward.py ADDED Viewed

@@ -0,0 +1,59 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class FF_ReLU(nn.Module):
+    def __init__(self,emb_dim,hidden_dim,device='cpu',dtype=torch.float32):
+        super().__init__()
+        self.relu=nn.Sequential(
+            nn.Linear(emb_dim,hidden_dim,device=device,dtype=dtype),
+            nn.ReLU(),
+            nn.Linear(hidden_dim,emb_dim,device=device,dtype=dtype),
+        )
+    def forward(self,x):
+        return self.relu(x)
+class FF_LeakyReLU(nn.Module):
+    def __init__(self,emb_dim,hidden_dim,negative_slope=0.1,device='cpu',dtype=torch.float32):
+        super().__init__()
+        self.l_relu=nn.Sequential(
+            nn.Linear(emb_dim,hidden_dim,device=device,dtype=dtype),
+            nn.LeakyReLU(negative_slope),
+            nn.Linear(hidden_dim,emb_dim,device=device,dtype=dtype),
+        )
+    def forward(self,x):
+        return self.l_relu(x)
+class FF_GELU(nn.Module):
+    def __init__(self,emb_dim,hidden_dim,device='cpu',dtype=torch.float32):
+        super().__init__()
+        self.gelu=nn.Sequential(
+            nn.Linear(emb_dim,hidden_dim,device=device,dtype=dtype),
+            nn.GELU(),
+            nn.Linear(hidden_dim,emb_dim,device=device,dtype=dtype),
+        )
+    def forward(self,x):
+        return self.gelu(x)
+class FF_Sigmoid(nn.Module):
+    def __init__(self,emb_dim,hidden_dim,device='cpu',dtype=torch.float32):
+        super().__init__()
+        self.sigmoid=nn.Sequential(
+            nn.Linear(emb_dim,hidden_dim,device=device,dtype=dtype),
+            nn.Sigmoid(),
+            nn.Linear(hidden_dim,emb_dim,device=device,dtype=dtype),
+        )
+    def forward(self,x):
+        return self.sigmoid(x)
+class FF_SiLU(nn.Module):
+    def __init__(self,emb_dim,hidden_dim,device='cpu',dtype=torch.float32):
+        super().__init__()
+        self.silu=nn.Sequential(
+            nn.Linear(emb_dim,hidden_dim,device=device,dtype=dtype),
+            nn.SiLU(),
+            nn.Linear(hidden_dim,emb_dim,device=device,dtype=dtype),
+        )
+    def forward(self,x):
+        return self.silu(x)

stackformer-0.1.0/modules/Normalization.py ADDED Viewed

@@ -0,0 +1,41 @@
+import torch
+import torch.nn as nn
+class LayerNorm(nn.Module):
+    def __init__(self, emb_dim, eps = 1e-5):
+        super().__init__()
+        self.eps = eps
+    def forward(self, x):
+        mean = x.mean(dim=-1, keepdim=True)
+        var = x.var(dim=-1, keepdim=True, unbiased=False)
+        norm_x = (x - mean) / torch.sqrt(var + self.eps)
+        return norm_x
+class LayerNorm(nn.Module):
+    def __init__(self, emb_dim, eps=1e-5, device='cpu', dtype=torch.float32):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(emb_dim, device=device, dtype=dtype))
+        self.bias = nn.Parameter(torch.zeros(emb_dim, device=device, dtype=dtype))
+    def forward(self, x):
+        mean = x.mean(dim=-1, keepdim=True)
+        var = x.var(dim=-1, keepdim=True, unbiased=False)
+        norm_x = (x - mean) / torch.sqrt(var + self.eps)
+        return norm_x * self.weight + self.bias
+# RMS = sqrt(Xn ** 2)
+# Norm = Xn / RMS
+class RMSNormilization(nn.Module):
+    def __init__(self,dim,eps=1e-5):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+    def forward(self,x):
+        rms = x.pow(2).mean(-1,keepdim=True).sqrt()
+        norm = self.weight * x / (rms + self.eps)
+        return norm

stackformer-0.1.0/modules/__init__.py ADDED Viewed

File without changes

stackformer-0.1.0/modules/mask.py ADDED Viewed

@@ -0,0 +1,36 @@
+# problem: Random mask and global mask
+import torch
+def casual_mask(Seq_len):
+    causal_mask = torch.triu(torch.ones(Seq_len, Seq_len, dtype=torch.bool), diagonal=1)
+    return causal_mask
+def sliding_window(Seq_len, window_size):
+    casual = torch.tril(torch.ones(Seq_len,Seq_len,dtype=bool))
+    band = torch.triu(casual, diagonal=-(window_size-1))
+    return ~band
+def dilated_casual_mask(Seq_len, dilation):
+    i = torch.arange(Seq_len).unsqueeze(1)
+    j = torch.arange(Seq_len).unsqueeze(0)
+    # causal and dilation condition
+    mask = (i >= j) & ((i - j) % dilation == 0)
+    return ~mask
+def random_mask(Seq_len, num_random):
+    mask = torch.zeros(Seq_len, Seq_len)
+    for i in range(Seq_len):
+        candidates = list(range(i))
+        if len(candidates) == 0:
+            continue
+        random_mask = torch.randperm(len(candidates))[:min(num_random, len(candidates))]
+        mask[i, torch.tensor([candidates[j] for j in random_mask])] = 1
+    return ~mask
+def global_mask(Seq_len, global_index):
+    global_index_tensor = torch.tensor(global_index)
+    mask = torch.zeros(Seq_len, Seq_len)
+    for g in global_index:
+        mask[g,:] = 1
+    mask[:,global_index_tensor] = 1
+    return ~mask

stackformer-0.1.0/modules/position_embedding.py ADDED Viewed

@@ -0,0 +1,61 @@
+import torch
+import torch.nn as nn
+import math
+# --- Absolute Positional Embedding ---
+class AbsolutePositionEmbedding(nn.Module):
+    def __init__(self, seq_len, emb_dim):
+        super().__init__()
+        self.seq_len = seq_len
+        self.emb_dim = emb_dim
+        self.embedding = nn.Embedding(seq_len, emb_dim)
+    def forward(self, x):
+        batch_size, seq_len = x.shape[0], x.shape[1]
+        positions = torch.arange(0, seq_len)
+        abs_pos = self.embedding(positions)  # (seq_len, emb_dim)
+        return abs_pos.unsqueeze(0).expand(batch_size, seq_len, -1).to(x.device)
+# --- Sinusoidal Positional Embedding ---
+class SinusoidalPositionalEmbedding(nn.Module):
+    def __init__(self, seq_len, emb_dim):
+        super().__init__()
+        self.seq_len = seq_len
+        self.emb_dim = emb_dim
+        position = torch.arange(0, seq_len).unsqueeze(1)
+        div_term = torch.exp(torch.arange(0, emb_dim, 2) * -(math.log(10000.0) / emb_dim))
+        pe = torch.zeros(seq_len, emb_dim)
+        pe[:, 0::2] = torch.sin(position * div_term)
+        pe[:, 1::2] = torch.cos(position * div_term)
+        self.register_buffer("pe", pe)
+    def forward(self, x):
+        # x shape: (batch_size, seq_len, emb_dim) or (batch_size, seq_len)
+        batch_size, seq_len = x.shape[0], x.shape[1]
+        return self.pe[:seq_len].unsqueeze(0).expand(batch_size, seq_len, -1).to(x.device)
+# --- RoPE ---
+class RoPE(nn.Module):
+    def __init__(self, head_dim, seq_len, theta=10000.0, device='cpu', dtype=torch.float32):
+        super().__init__()
+        self.dtype  = dtype
+        self.device = device
+        assert head_dim % 2 == 0, "head_dim must be even"
+        theta_numerator = torch.arange(0, head_dim, 2, device=device, dtype=dtype)
+        inv_freq = 1.0 / (theta ** (theta_numerator / head_dim))
+        m = torch.arange(seq_len, device=device)
+        freqs = torch.outer(m, inv_freq)
+        self.register_buffer("freq_complex", torch.polar(torch.ones_like(freqs), freqs))
+    def forward(self, x):
+        batch_size, seq_len, num_head, emb_dim = x.shape
+        assert emb_dim % 2 == 0, "emb_dim must be even"
+        x_reshaped = x.view(batch_size, seq_len, num_head, emb_dim // 2, 2)
+        x_complex = torch.view_as_complex(x_reshaped)
+        freqs = self.freq_complex[:seq_len].unsqueeze(0).unsqueeze(2)  # (1, seq_len, 1, head_dim//2)
+        x_rotated = x_complex * freqs
+        x_out = torch.view_as_real(x_rotated).contiguous().view(batch_size, seq_len, num_head, emb_dim)
+        return x_out.to(device=self.device, dtype=self.dtype)

stackformer-0.1.0/modules/tokenizer.py ADDED Viewed

@@ -0,0 +1,25 @@
+import torch
+import torch.nn as nn
+import tiktoken
+# Bite pair (BPE) Embedding using tokienizer
+class Embedding_using_tiktoken:
+    def __init__(self,data,embedding_dim,model: str):
+        self.tokenizer = tiktoken.get_encoding(model)
+    def encoding(self,data,embedding_dim):
+        max_token_id = self.tokenizer.n_vocab
+        embedding_layer = nn.Embedding(num_embeddings = max_token_id, embedding_dim = embedding_dim)
+        tensors = torch.tensor(self.tokenizer.encode(data))
+        embedded = embedding_layer(tensors)
+        return embedded
+    def decoding(self,data):
+        return self.tokenizer.decode(data)
+    def vocab_size(self):
+        return self.tokenizer.n_vocab
+    def model_list(self):
+        return tiktoken.list_encoding_names()

stackformer-0.1.0/pyproject.toml ADDED Viewed

@@ -0,0 +1,25 @@
+[project]
+name = "Stackformer"
+version = "0.1.0"
+description = "Modular transformer blocks built in PyTorch"
+readme = "README.md"
+requires-python = ">=3.9"
+license = {text = "MIT"}
+authors = [
+  {name = "Gurumurthy", email = "gurumurthy.00300@gmail.com"}
+]
+dependencies = [
+  "torch>=2.6",
+  "tqdm>=4.67"
+]
+[project.urls]
+"Repository" = "https://github.com/Gurumurthy30/Stackformer"
+"Issue Tracker" = "https://github.com/Gurumurthy30/Stackformer/issues"
+"Discussions" = "https://github.com/Gurumurthy30/Stackformer/discussions"
+[build-system]
+requires = ["setuptools>=61", "wheel"]
+build-backend = "setuptools.build_meta"

stackformer-0.1.0/setup.cfg ADDED Viewed

@@ -0,0 +1,4 @@
+[egg_info]
+tag_build =
+tag_date = 0

stackformer-0.1.0/setup.py ADDED Viewed

@@ -0,0 +1,31 @@
+from setuptools import setup, find_packages
+setup(
+    name="Stackformer",
+    version="0.1.0",
+    description="Modular transformer blocks built in PyTorch",
+    # long_description=open("README.md", "r", encoding="utf-8").read(),
+    # long_description_content_type="text/markdown",
+    author="Gurumurthy",
+    author_email="gurumurthy.00300@gmail.com",
+    url="https://github.com/Gurumurthy30/Stackformer",
+    project_urls={
+        "Repository": "https://github.com/Gurumurthy30/Stackformer",
+        "Issue Tracker": "https://github.com/Gurumurthy30/Stackformer/issues",
+        "Discussions": "https://github.com/Gurumurthy30/Stackformer/discussions",
+    },
+    license="MIT",
+    python_requires=">=3.9",
+    packages=find_packages(exclude=["tests", "examples"]),
+    install_requires=[
+        "torch>=2.6",
+        "tqdm>=4.67",
+    ],
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+        "Topic :: Scientific/Engineering :: Artificial Intelligence",
+        "Topic :: Software Development :: Libraries :: Python Modules",
+    ],
+)