npm - tech-hub-skills - Versions diffs - 1.5.1 → 1.5.2 - Mend

tech-hub-skills 1.5.1 → 1.5.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (197) hide show

package/.claude/LICENSE +21 -21
package/.claude/README.md +291 -291
package/.claude/bin/cli.js +266 -266
package/.claude/bin/copilot.js +182 -182
package/.claude/bin/postinstall.js +42 -42
package/.claude/commands/README.md +336 -336
package/.claude/commands/ai-engineer.md +104 -104
package/.claude/commands/aws.md +143 -143
package/.claude/commands/azure.md +149 -149
package/.claude/commands/backend-developer.md +108 -108
package/.claude/commands/code-review.md +399 -399
package/.claude/commands/compliance-automation.md +747 -747
package/.claude/commands/compliance-officer.md +108 -108
package/.claude/commands/data-engineer.md +113 -113
package/.claude/commands/data-governance.md +102 -102
package/.claude/commands/data-scientist.md +123 -123
package/.claude/commands/database-admin.md +109 -109
package/.claude/commands/devops.md +160 -160
package/.claude/commands/docker.md +160 -160
package/.claude/commands/enterprise-dashboard.md +613 -613
package/.claude/commands/finops.md +184 -184
package/.claude/commands/frontend-developer.md +108 -108
package/.claude/commands/gcp.md +143 -143
package/.claude/commands/ml-engineer.md +115 -115
package/.claude/commands/mlops.md +187 -187
package/.claude/commands/network-engineer.md +109 -109
package/.claude/commands/optimization-advisor.md +329 -329
package/.claude/commands/orchestrator.md +623 -623
package/.claude/commands/platform-engineer.md +102 -102
package/.claude/commands/process-automation.md +226 -226
package/.claude/commands/process-changelog.md +184 -184
package/.claude/commands/process-documentation.md +484 -484
package/.claude/commands/process-kanban.md +324 -324
package/.claude/commands/process-versioning.md +214 -214
package/.claude/commands/product-designer.md +104 -104
package/.claude/commands/project-starter.md +443 -443
package/.claude/commands/qa-engineer.md +109 -109
package/.claude/commands/security-architect.md +135 -135
package/.claude/commands/sre.md +109 -109
package/.claude/commands/system-design.md +126 -126
package/.claude/commands/technical-writer.md +101 -101
package/.claude/package.json +46 -46
package/.claude/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -252
package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_ab_tester.py +356 -356
package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_template_manager.py +274 -274
package/.claude/roles/ai-engineer/skills/01-prompt-engineering/token_cost_estimator.py +324 -324
package/.claude/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -448
package/.claude/roles/ai-engineer/skills/02-rag-pipeline/document_chunker.py +336 -336
package/.claude/roles/ai-engineer/skills/02-rag-pipeline/rag_pipeline.sql +213 -213
package/.claude/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -599
package/.claude/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -735
package/.claude/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -711
package/.claude/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -777
package/.claude/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -264
package/.claude/roles/azure/skills/02-data-factory/README.md +264 -264
package/.claude/roles/azure/skills/03-synapse-analytics/README.md +264 -264
package/.claude/roles/azure/skills/04-databricks/README.md +264 -264
package/.claude/roles/azure/skills/05-functions/README.md +264 -264
package/.claude/roles/azure/skills/06-kubernetes-service/README.md +264 -264
package/.claude/roles/azure/skills/07-openai-service/README.md +264 -264
package/.claude/roles/azure/skills/08-machine-learning/README.md +264 -264
package/.claude/roles/azure/skills/09-storage-adls/README.md +264 -264
package/.claude/roles/azure/skills/10-networking/README.md +264 -264
package/.claude/roles/azure/skills/11-sql-cosmos/README.md +264 -264
package/.claude/roles/azure/skills/12-event-hubs/README.md +264 -264
package/.claude/roles/code-review/skills/01-automated-code-review/README.md +394 -394
package/.claude/roles/code-review/skills/02-pr-review-workflow/README.md +427 -427
package/.claude/roles/code-review/skills/03-code-quality-gates/README.md +518 -518
package/.claude/roles/code-review/skills/04-reviewer-assignment/README.md +504 -504
package/.claude/roles/code-review/skills/05-review-analytics/README.md +540 -540
package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -550
package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/bronze_ingestion.py +337 -337
package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/medallion_queries.sql +300 -300
package/.claude/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -580
package/.claude/roles/data-engineer/skills/03-data-quality/README.md +579 -579
package/.claude/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -608
package/.claude/roles/data-engineer/skills/05-performance-optimization/README.md +547 -547
package/.claude/roles/data-governance/skills/01-data-catalog/README.md +112 -112
package/.claude/roles/data-governance/skills/02-data-lineage/README.md +129 -129
package/.claude/roles/data-governance/skills/03-data-quality-framework/README.md +182 -182
package/.claude/roles/data-governance/skills/04-access-control/README.md +39 -39
package/.claude/roles/data-governance/skills/05-master-data-management/README.md +40 -40
package/.claude/roles/data-governance/skills/06-compliance-privacy/README.md +46 -46
package/.claude/roles/data-scientist/skills/01-eda-automation/README.md +230 -230
package/.claude/roles/data-scientist/skills/01-eda-automation/eda_generator.py +446 -446
package/.claude/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -264
package/.claude/roles/data-scientist/skills/03-feature-engineering/README.md +264 -264
package/.claude/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -264
package/.claude/roles/data-scientist/skills/05-customer-analytics/README.md +264 -264
package/.claude/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -264
package/.claude/roles/data-scientist/skills/07-experimentation/README.md +264 -264
package/.claude/roles/data-scientist/skills/08-data-visualization/README.md +264 -264
package/.claude/roles/devops/skills/01-cicd-pipeline/README.md +264 -264
package/.claude/roles/devops/skills/02-container-orchestration/README.md +264 -264
package/.claude/roles/devops/skills/03-infrastructure-as-code/README.md +264 -264
package/.claude/roles/devops/skills/04-gitops/README.md +264 -264
package/.claude/roles/devops/skills/05-environment-management/README.md +264 -264
package/.claude/roles/devops/skills/06-automated-testing/README.md +264 -264
package/.claude/roles/devops/skills/07-release-management/README.md +264 -264
package/.claude/roles/devops/skills/08-monitoring-alerting/README.md +264 -264
package/.claude/roles/devops/skills/09-devsecops/README.md +265 -265
package/.claude/roles/finops/skills/01-cost-visibility/README.md +264 -264
package/.claude/roles/finops/skills/02-resource-tagging/README.md +264 -264
package/.claude/roles/finops/skills/03-budget-management/README.md +264 -264
package/.claude/roles/finops/skills/04-reserved-instances/README.md +264 -264
package/.claude/roles/finops/skills/05-spot-optimization/README.md +264 -264
package/.claude/roles/finops/skills/06-storage-tiering/README.md +264 -264
package/.claude/roles/finops/skills/07-compute-rightsizing/README.md +264 -264
package/.claude/roles/finops/skills/08-chargeback/README.md +264 -264
package/.claude/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -566
package/.claude/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -655
package/.claude/roles/ml-engineer/skills/03-model-training/README.md +704 -704
package/.claude/roles/ml-engineer/skills/04-model-serving/README.md +845 -845
package/.claude/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -874
package/.claude/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -264
package/.claude/roles/mlops/skills/02-experiment-tracking/README.md +264 -264
package/.claude/roles/mlops/skills/03-model-registry/README.md +264 -264
package/.claude/roles/mlops/skills/04-feature-store/README.md +264 -264
package/.claude/roles/mlops/skills/05-model-deployment/README.md +264 -264
package/.claude/roles/mlops/skills/06-model-observability/README.md +264 -264
package/.claude/roles/mlops/skills/07-data-versioning/README.md +264 -264
package/.claude/roles/mlops/skills/08-ab-testing/README.md +264 -264
package/.claude/roles/mlops/skills/09-automated-retraining/README.md +264 -264
package/.claude/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -153
package/.claude/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -57
package/.claude/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -59
package/.claude/roles/platform-engineer/skills/04-developer-experience/README.md +57 -57
package/.claude/roles/platform-engineer/skills/05-incident-management/README.md +73 -73
package/.claude/roles/platform-engineer/skills/06-capacity-management/README.md +59 -59
package/.claude/roles/product-designer/skills/01-requirements-discovery/README.md +407 -407
package/.claude/roles/product-designer/skills/02-user-research/README.md +382 -382
package/.claude/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -437
package/.claude/roles/product-designer/skills/04-ux-design/README.md +496 -496
package/.claude/roles/product-designer/skills/05-product-market-fit/README.md +376 -376
package/.claude/roles/product-designer/skills/06-stakeholder-management/README.md +412 -412
package/.claude/roles/security-architect/skills/01-pii-detection/README.md +319 -319
package/.claude/roles/security-architect/skills/02-threat-modeling/README.md +264 -264
package/.claude/roles/security-architect/skills/03-infrastructure-security/README.md +264 -264
package/.claude/roles/security-architect/skills/04-iam/README.md +264 -264
package/.claude/roles/security-architect/skills/05-application-security/README.md +264 -264
package/.claude/roles/security-architect/skills/06-secrets-management/README.md +264 -264
package/.claude/roles/security-architect/skills/07-security-monitoring/README.md +264 -264
package/.claude/roles/system-design/skills/01-architecture-patterns/README.md +337 -337
package/.claude/roles/system-design/skills/02-requirements-engineering/README.md +264 -264
package/.claude/roles/system-design/skills/03-scalability/README.md +264 -264
package/.claude/roles/system-design/skills/04-high-availability/README.md +264 -264
package/.claude/roles/system-design/skills/05-cost-optimization-design/README.md +264 -264
package/.claude/roles/system-design/skills/06-api-design/README.md +264 -264
package/.claude/roles/system-design/skills/07-observability-architecture/README.md +264 -264
package/.claude/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -336
package/.claude/roles/system-design/skills/08-process-automation/README.md +521 -521
package/.claude/roles/system-design/skills/08-process-automation/ai_prompt_generator.py +744 -744
package/.claude/roles/system-design/skills/08-process-automation/automation_recommender.py +688 -688
package/.claude/roles/system-design/skills/08-process-automation/plan_generator.py +679 -679
package/.claude/roles/system-design/skills/08-process-automation/process_analyzer.py +528 -528
package/.claude/roles/system-design/skills/08-process-automation/process_parser.py +684 -684
package/.claude/roles/system-design/skills/08-process-automation/role_matcher.py +615 -615
package/.claude/skills/README.md +336 -336
package/.claude/skills/ai-engineer.md +104 -104
package/.claude/skills/aws.md +143 -143
package/.claude/skills/azure.md +149 -149
package/.claude/skills/backend-developer.md +108 -108
package/.claude/skills/code-review.md +399 -399
package/.claude/skills/compliance-automation.md +747 -747
package/.claude/skills/compliance-officer.md +108 -108
package/.claude/skills/data-engineer.md +113 -113
package/.claude/skills/data-governance.md +102 -102
package/.claude/skills/data-scientist.md +123 -123
package/.claude/skills/database-admin.md +109 -109
package/.claude/skills/devops.md +160 -160
package/.claude/skills/docker.md +160 -160
package/.claude/skills/enterprise-dashboard.md +613 -613
package/.claude/skills/finops.md +184 -184
package/.claude/skills/frontend-developer.md +108 -108
package/.claude/skills/gcp.md +143 -143
package/.claude/skills/ml-engineer.md +115 -115
package/.claude/skills/mlops.md +187 -187
package/.claude/skills/network-engineer.md +109 -109
package/.claude/skills/optimization-advisor.md +329 -329
package/.claude/skills/orchestrator.md +623 -623
package/.claude/skills/platform-engineer.md +102 -102
package/.claude/skills/process-automation.md +226 -226
package/.claude/skills/process-changelog.md +184 -184
package/.claude/skills/process-documentation.md +484 -484
package/.claude/skills/process-kanban.md +324 -324
package/.claude/skills/process-versioning.md +214 -214
package/.claude/skills/product-designer.md +104 -104
package/.claude/skills/project-starter.md +443 -443
package/.claude/skills/qa-engineer.md +109 -109
package/.claude/skills/security-architect.md +135 -135
package/.claude/skills/sre.md +109 -109
package/.claude/skills/system-design.md +126 -126
package/.claude/skills/technical-writer.md +101 -101
package/.gitattributes +2 -2
package/GITHUB_COPILOT.md +106 -106
package/README.md +192 -184
package/package.json +16 -8

package/.claude/roles/ai-engineer/skills/02-rag-pipeline/document_chunker.py CHANGED Viewed

@@ -1,336 +1,336 @@
-"""
-Advanced Document Chunking for RAG Systems
-Supports semantic, recursive, and fixed-size chunking strategies.
-"""
-from typing import List, Dict, Any, Optional
-from dataclasses import dataclass
-from enum import Enum
-import re
-from langchain.text_splitter import (
-    RecursiveCharacterTextSplitter,
-    CharacterTextSplitter,
-    TokenTextSplitter
-)
-class ChunkStrategy(Enum):
-    """Available chunking strategies."""
-    FIXED = "fixed"  # Fixed character/token size
-    SEMANTIC = "semantic"  # Semantic boundaries (paragraphs, sentences)
-    RECURSIVE = "recursive"  # Recursive splitting with multiple separators
-    SLIDING_WINDOW = "sliding_window"  # Overlapping windows
-@dataclass
-class Chunk:
-    """A document chunk with metadata."""
-    content: str
-    chunk_id: str
-    document_id: str
-    chunk_index: int
-    metadata: Dict[str, Any]
-    char_count: int
-    token_count: Optional[int] = None
-    def __post_init__(self):
-        if self.char_count == 0:
-            self.char_count = len(self.content)
-class DocumentChunker:
-    """Advanced document chunker with multiple strategies."""
-    def __init__(
-        self,
-        strategy: ChunkStrategy = ChunkStrategy.RECURSIVE,
-        chunk_size: int = 1000,
-        chunk_overlap: int = 200,
-        separators: Optional[List[str]] = None
-    ):
-        """
-        Initialize document chunker.
-        Args:
-            strategy: Chunking strategy to use
-            chunk_size: Target chunk size (characters or tokens)
-            chunk_overlap: Overlap between chunks
-            separators: Custom separators for recursive splitting
-        """
-        self.strategy = strategy
-        self.chunk_size = chunk_size
-        self.chunk_overlap = chunk_overlap
-        self.separators = separators or ["\n\n", "\n", ". ", " ", ""]
-        self._init_splitter()
-    def _init_splitter(self):
-        """Initialize the appropriate text splitter."""
-        if self.strategy == ChunkStrategy.RECURSIVE:
-            self.splitter = RecursiveCharacterTextSplitter(
-                chunk_size=self.chunk_size,
-                chunk_overlap=self.chunk_overlap,
-                separators=self.separators,
-                length_function=len
-            )
-        elif self.strategy == ChunkStrategy.FIXED:
-            self.splitter = CharacterTextSplitter(
-                chunk_size=self.chunk_size,
-                chunk_overlap=self.chunk_overlap,
-                separator="\n"
-            )
-        elif self.strategy == ChunkStrategy.SEMANTIC:
-            # For semantic chunking, we'll use custom logic
-            self.splitter = None
-        else:
-            self.splitter = RecursiveCharacterTextSplitter(
-                chunk_size=self.chunk_size,
-                chunk_overlap=self.chunk_overlap
-            )
-    def chunk_document(
-        self,
-        text: str,
-        document_id: str,
-        metadata: Optional[Dict[str, Any]] = None
-    ) -> List[Chunk]:
-        """
-        Chunk a document into smaller pieces.
-        Args:
-            text: Document text
-            document_id: Unique document identifier
-            metadata: Additional metadata
-        Returns:
-            List of Chunk objects
-        """
-        metadata = metadata or {}
-        if self.strategy == ChunkStrategy.SEMANTIC:
-            text_chunks = self._semantic_chunking(text)
-        elif self.strategy == ChunkStrategy.SLIDING_WINDOW:
-            text_chunks = self._sliding_window_chunking(text)
-        else:
-            text_chunks = self.splitter.split_text(text)
-        chunks = []
-        for idx, chunk_text in enumerate(text_chunks):
-            chunk = Chunk(
-                content=chunk_text,
-                chunk_id=f"{document_id}_chunk_{idx}",
-                document_id=document_id,
-                chunk_index=idx,
-                metadata={**metadata, "strategy": self.strategy.value},
-                char_count=len(chunk_text)
-            )
-            chunks.append(chunk)
-        return chunks
-    def _semantic_chunking(self, text: str) -> List[str]:
-        """
-        Chunk by semantic boundaries (paragraphs with context).
-        This strategy:
-        1. Splits on paragraph boundaries
-        2. Combines small paragraphs
-        3. Ensures chunks don't exceed max size
-        """
-        # Split into paragraphs
-        paragraphs = re.split(r'\n\s*\n', text)
-        chunks = []
-        current_chunk = []
-        current_length = 0
-        for para in paragraphs:
-            para = para.strip()
-            if not para:
-                continue
-            para_length = len(para)
-            # If paragraph alone exceeds chunk size, split it
-            if para_length > self.chunk_size:
-                # Save current chunk if exists
-                if current_chunk:
-                    chunks.append("\n\n".join(current_chunk))
-                    current_chunk = []
-                    current_length = 0
-                # Split large paragraph
-                sentences = re.split(r'(?<=[.!?])\s+', para)
-                temp_chunk = []
-                temp_length = 0
-                for sentence in sentences:
-                    sent_length = len(sentence)
-                    if temp_length + sent_length > self.chunk_size:
-                        if temp_chunk:
-                            chunks.append(" ".join(temp_chunk))
-                        temp_chunk = [sentence]
-                        temp_length = sent_length
-                    else:
-                        temp_chunk.append(sentence)
-                        temp_length += sent_length + 1
-                if temp_chunk:
-                    chunks.append(" ".join(temp_chunk))
-            # If adding paragraph exceeds chunk size, save current chunk
-            elif current_length + para_length > self.chunk_size:
-                if current_chunk:
-                    chunks.append("\n\n".join(current_chunk))
-                current_chunk = [para]
-                current_length = para_length
-            # Otherwise, add to current chunk
-            else:
-                current_chunk.append(para)
-                current_length += para_length + 2  # +2 for \n\n
-        # Add remaining chunk
-        if current_chunk:
-            chunks.append("\n\n".join(current_chunk))
-        return chunks
-    def _sliding_window_chunking(self, text: str) -> List[str]:
-        """
-        Create overlapping chunks with sliding window.
-        Useful for ensuring important content at chunk boundaries isn't lost.
-        """
-        chunks = []
-        start = 0
-        while start < len(text):
-            end = start + self.chunk_size
-            chunk = text[start:end]
-            # Try to end at sentence boundary
-            if end < len(text):
-                last_period = chunk.rfind('. ')
-                if last_period > self.chunk_size // 2:
-                    chunk = chunk[:last_period + 1]
-                    end = start + last_period + 1
-            chunks.append(chunk.strip())
-            # Move start forward (with overlap)
-            start = end - self.chunk_overlap
-        return chunks
-    def chunk_multiple_documents(
-        self,
-        documents: List[Dict[str, Any]]
-    ) -> List[Chunk]:
-        """
-        Chunk multiple documents.
-        Args:
-            documents: List of dicts with 'id', 'text', and optional 'metadata'
-        Returns:
-            List of all chunks
-        """
-        all_chunks = []
-        for doc in documents:
-            chunks = self.chunk_document(
-                text=doc['text'],
-                document_id=doc['id'],
-                metadata=doc.get('metadata', {})
-            )
-            all_chunks.extend(chunks)
-        return all_chunks
-    def get_chunk_statistics(self, chunks: List[Chunk]) -> Dict[str, Any]:
-        """Get statistics about chunks."""
-        if not chunks:
-            return {}
-        char_counts = [c.char_count for c in chunks]
-        return {
-            "total_chunks": len(chunks),
-            "total_characters": sum(char_counts),
-            "avg_chunk_size": sum(char_counts) / len(chunks),
-            "min_chunk_size": min(char_counts),
-            "max_chunk_size": max(char_counts),
-            "unique_documents": len(set(c.document_id for c in chunks)),
-            "strategy": self.strategy.value
-        }
-# Example usage
-if __name__ == "__main__":
-    # Sample document
-    sample_doc = """
-    Marketing Campaign Analysis Best Practices
-    Effective marketing campaign analysis requires a systematic approach to data collection and interpretation.
-    Data Collection
-    First, ensure you're tracking the right metrics. Common KPIs include impression count, click-through rates (CTR), conversion rates, and return on ad spend (ROAS). Use tracking pixels and UTM parameters to accurately attribute conversions.
-    Campaign Segmentation
-    Break down your analysis by campaign type, channel, audience segment, and time period. This granular view helps identify what's working and what isn't. For example, email campaigns might perform better with certain demographics, while social media ads resonate with others.
-    Performance Benchmarking
-    Compare your results against industry benchmarks and historical data. A 2% CTR might seem low in isolation, but could be excellent for your industry. Track performance over time to identify trends and seasonality.
-    Attribution Modeling
-    Understand the customer journey. Did they convert after the first touchpoint or after multiple interactions? Multi-touch attribution helps allocate credit appropriately across channels.
-    A/B Testing
-    Never stop testing. Test subject lines, ad copy, images, calls-to-action, and landing pages. Use statistical significance testing to ensure your results are valid.
-    Reporting and Insights
-    Create actionable reports that tell a story. Don't just show numbers—explain what they mean and what actions should be taken. Use visualizations to make data accessible.
-    Continuous Optimization
-    Marketing is iterative. Use insights from each campaign to improve the next one. Build a knowledge base of what works for your audience.
-    """
-    print("=" * 80)
-    print("Document Chunking Demonstrations")
-    print("=" * 80)
-    # Test different chunking strategies
-    strategies = [
-        (ChunkStrategy.RECURSIVE, "Recursive (smart boundaries)"),
-        (ChunkStrategy.SEMANTIC, "Semantic (paragraph-based)"),
-        (ChunkStrategy.SLIDING_WINDOW, "Sliding Window (overlapping)"),
-        (ChunkStrategy.FIXED, "Fixed Size")
-    ]
-    for strategy, description in strategies:
-        print(f"\n📄 Strategy: {description}")
-        print("-" * 80)
-        chunker = DocumentChunker(
-            strategy=strategy,
-            chunk_size=300,
-            chunk_overlap=50
-        )
-        chunks = chunker.chunk_document(
-            text=sample_doc,
-            document_id="campaign_analysis_guide",
-            metadata={"category": "marketing", "author": "Tech Hub"}
-        )
-        stats = chunker.get_chunk_statistics(chunks)
-        print(f"Total chunks: {stats['total_chunks']}")
-        print(f"Avg chunk size: {stats['avg_chunk_size']:.0f} chars")
-        print(f"Size range: {stats['min_chunk_size']}-{stats['max_chunk_size']} chars")
-        print(f"\nFirst chunk preview:")
-        print(f"{chunks[0].content[:200]}...")
-        print(f"\nChunk IDs: {[c.chunk_id for c in chunks]}")
+"""
+Advanced Document Chunking for RAG Systems
+Supports semantic, recursive, and fixed-size chunking strategies.
+"""
+from typing import List, Dict, Any, Optional
+from dataclasses import dataclass
+from enum import Enum
+import re
+from langchain.text_splitter import (
+    RecursiveCharacterTextSplitter,
+    CharacterTextSplitter,
+    TokenTextSplitter
+)
+class ChunkStrategy(Enum):
+    """Available chunking strategies."""
+    FIXED = "fixed"  # Fixed character/token size
+    SEMANTIC = "semantic"  # Semantic boundaries (paragraphs, sentences)
+    RECURSIVE = "recursive"  # Recursive splitting with multiple separators
+    SLIDING_WINDOW = "sliding_window"  # Overlapping windows
+@dataclass
+class Chunk:
+    """A document chunk with metadata."""
+    content: str
+    chunk_id: str
+    document_id: str
+    chunk_index: int
+    metadata: Dict[str, Any]
+    char_count: int
+    token_count: Optional[int] = None
+    def __post_init__(self):
+        if self.char_count == 0:
+            self.char_count = len(self.content)
+class DocumentChunker:
+    """Advanced document chunker with multiple strategies."""
+    def __init__(
+        self,
+        strategy: ChunkStrategy = ChunkStrategy.RECURSIVE,
+        chunk_size: int = 1000,
+        chunk_overlap: int = 200,
+        separators: Optional[List[str]] = None
+    ):
+        """
+        Initialize document chunker.
+        Args:
+            strategy: Chunking strategy to use
+            chunk_size: Target chunk size (characters or tokens)
+            chunk_overlap: Overlap between chunks
+            separators: Custom separators for recursive splitting
+        """
+        self.strategy = strategy
+        self.chunk_size = chunk_size
+        self.chunk_overlap = chunk_overlap
+        self.separators = separators or ["\n\n", "\n", ". ", " ", ""]
+        self._init_splitter()
+    def _init_splitter(self):
+        """Initialize the appropriate text splitter."""
+        if self.strategy == ChunkStrategy.RECURSIVE:
+            self.splitter = RecursiveCharacterTextSplitter(
+                chunk_size=self.chunk_size,
+                chunk_overlap=self.chunk_overlap,
+                separators=self.separators,
+                length_function=len
+            )
+        elif self.strategy == ChunkStrategy.FIXED:
+            self.splitter = CharacterTextSplitter(
+                chunk_size=self.chunk_size,
+                chunk_overlap=self.chunk_overlap,
+                separator="\n"
+            )
+        elif self.strategy == ChunkStrategy.SEMANTIC:
+            # For semantic chunking, we'll use custom logic
+            self.splitter = None
+        else:
+            self.splitter = RecursiveCharacterTextSplitter(
+                chunk_size=self.chunk_size,
+                chunk_overlap=self.chunk_overlap
+            )
+    def chunk_document(
+        self,
+        text: str,
+        document_id: str,
+        metadata: Optional[Dict[str, Any]] = None
+    ) -> List[Chunk]:
+        """
+        Chunk a document into smaller pieces.
+        Args:
+            text: Document text
+            document_id: Unique document identifier
+            metadata: Additional metadata
+        Returns:
+            List of Chunk objects
+        """
+        metadata = metadata or {}
+        if self.strategy == ChunkStrategy.SEMANTIC:
+            text_chunks = self._semantic_chunking(text)
+        elif self.strategy == ChunkStrategy.SLIDING_WINDOW:
+            text_chunks = self._sliding_window_chunking(text)
+        else:
+            text_chunks = self.splitter.split_text(text)
+        chunks = []
+        for idx, chunk_text in enumerate(text_chunks):
+            chunk = Chunk(
+                content=chunk_text,
+                chunk_id=f"{document_id}_chunk_{idx}",
+                document_id=document_id,
+                chunk_index=idx,
+                metadata={**metadata, "strategy": self.strategy.value},
+                char_count=len(chunk_text)
+            )
+            chunks.append(chunk)
+        return chunks
+    def _semantic_chunking(self, text: str) -> List[str]:
+        """
+        Chunk by semantic boundaries (paragraphs with context).
+        This strategy:
+        1. Splits on paragraph boundaries
+        2. Combines small paragraphs
+        3. Ensures chunks don't exceed max size
+        """
+        # Split into paragraphs
+        paragraphs = re.split(r'\n\s*\n', text)
+        chunks = []
+        current_chunk = []
+        current_length = 0
+        for para in paragraphs:
+            para = para.strip()
+            if not para:
+                continue
+            para_length = len(para)
+            # If paragraph alone exceeds chunk size, split it
+            if para_length > self.chunk_size:
+                # Save current chunk if exists
+                if current_chunk:
+                    chunks.append("\n\n".join(current_chunk))
+                    current_chunk = []
+                    current_length = 0
+                # Split large paragraph
+                sentences = re.split(r'(?<=[.!?])\s+', para)
+                temp_chunk = []
+                temp_length = 0
+                for sentence in sentences:
+                    sent_length = len(sentence)
+                    if temp_length + sent_length > self.chunk_size:
+                        if temp_chunk:
+                            chunks.append(" ".join(temp_chunk))
+                        temp_chunk = [sentence]
+                        temp_length = sent_length
+                    else:
+                        temp_chunk.append(sentence)
+                        temp_length += sent_length + 1
+                if temp_chunk:
+                    chunks.append(" ".join(temp_chunk))
+            # If adding paragraph exceeds chunk size, save current chunk
+            elif current_length + para_length > self.chunk_size:
+                if current_chunk:
+                    chunks.append("\n\n".join(current_chunk))
+                current_chunk = [para]
+                current_length = para_length
+            # Otherwise, add to current chunk
+            else:
+                current_chunk.append(para)
+                current_length += para_length + 2  # +2 for \n\n
+        # Add remaining chunk
+        if current_chunk:
+            chunks.append("\n\n".join(current_chunk))
+        return chunks
+    def _sliding_window_chunking(self, text: str) -> List[str]:
+        """
+        Create overlapping chunks with sliding window.
+        Useful for ensuring important content at chunk boundaries isn't lost.
+        """
+        chunks = []
+        start = 0
+        while start < len(text):
+            end = start + self.chunk_size
+            chunk = text[start:end]
+            # Try to end at sentence boundary
+            if end < len(text):
+                last_period = chunk.rfind('. ')
+                if last_period > self.chunk_size // 2:
+                    chunk = chunk[:last_period + 1]
+                    end = start + last_period + 1
+            chunks.append(chunk.strip())
+            # Move start forward (with overlap)
+            start = end - self.chunk_overlap
+        return chunks
+    def chunk_multiple_documents(
+        self,
+        documents: List[Dict[str, Any]]
+    ) -> List[Chunk]:
+        """
+        Chunk multiple documents.
+        Args:
+            documents: List of dicts with 'id', 'text', and optional 'metadata'
+        Returns:
+            List of all chunks
+        """
+        all_chunks = []
+        for doc in documents:
+            chunks = self.chunk_document(
+                text=doc['text'],
+                document_id=doc['id'],
+                metadata=doc.get('metadata', {})
+            )
+            all_chunks.extend(chunks)
+        return all_chunks
+    def get_chunk_statistics(self, chunks: List[Chunk]) -> Dict[str, Any]:
+        """Get statistics about chunks."""
+        if not chunks:
+            return {}
+        char_counts = [c.char_count for c in chunks]
+        return {
+            "total_chunks": len(chunks),
+            "total_characters": sum(char_counts),
+            "avg_chunk_size": sum(char_counts) / len(chunks),
+            "min_chunk_size": min(char_counts),
+            "max_chunk_size": max(char_counts),
+            "unique_documents": len(set(c.document_id for c in chunks)),
+            "strategy": self.strategy.value
+        }
+# Example usage
+if __name__ == "__main__":
+    # Sample document
+    sample_doc = """
+    Marketing Campaign Analysis Best Practices
+    Effective marketing campaign analysis requires a systematic approach to data collection and interpretation.
+    Data Collection
+    First, ensure you're tracking the right metrics. Common KPIs include impression count, click-through rates (CTR), conversion rates, and return on ad spend (ROAS). Use tracking pixels and UTM parameters to accurately attribute conversions.
+    Campaign Segmentation
+    Break down your analysis by campaign type, channel, audience segment, and time period. This granular view helps identify what's working and what isn't. For example, email campaigns might perform better with certain demographics, while social media ads resonate with others.
+    Performance Benchmarking
+    Compare your results against industry benchmarks and historical data. A 2% CTR might seem low in isolation, but could be excellent for your industry. Track performance over time to identify trends and seasonality.
+    Attribution Modeling
+    Understand the customer journey. Did they convert after the first touchpoint or after multiple interactions? Multi-touch attribution helps allocate credit appropriately across channels.
+    A/B Testing
+    Never stop testing. Test subject lines, ad copy, images, calls-to-action, and landing pages. Use statistical significance testing to ensure your results are valid.
+    Reporting and Insights
+    Create actionable reports that tell a story. Don't just show numbers—explain what they mean and what actions should be taken. Use visualizations to make data accessible.
+    Continuous Optimization
+    Marketing is iterative. Use insights from each campaign to improve the next one. Build a knowledge base of what works for your audience.
+    """
+    print("=" * 80)
+    print("Document Chunking Demonstrations")
+    print("=" * 80)
+    # Test different chunking strategies
+    strategies = [
+        (ChunkStrategy.RECURSIVE, "Recursive (smart boundaries)"),
+        (ChunkStrategy.SEMANTIC, "Semantic (paragraph-based)"),
+        (ChunkStrategy.SLIDING_WINDOW, "Sliding Window (overlapping)"),
+        (ChunkStrategy.FIXED, "Fixed Size")
+    ]
+    for strategy, description in strategies:
+        print(f"\n📄 Strategy: {description}")
+        print("-" * 80)
+        chunker = DocumentChunker(
+            strategy=strategy,
+            chunk_size=300,
+            chunk_overlap=50
+        )
+        chunks = chunker.chunk_document(
+            text=sample_doc,
+            document_id="campaign_analysis_guide",
+            metadata={"category": "marketing", "author": "Tech Hub"}
+        )
+        stats = chunker.get_chunk_statistics(chunks)
+        print(f"Total chunks: {stats['total_chunks']}")
+        print(f"Avg chunk size: {stats['avg_chunk_size']:.0f} chars")
+        print(f"Size range: {stats['min_chunk_size']}-{stats['max_chunk_size']} chars")
+        print(f"\nFirst chunk preview:")
+        print(f"{chunks[0].content[:200]}...")
+        print(f"\nChunk IDs: {[c.chunk_id for c in chunks]}")