PyPI - TeLLMgramBot - Versions diffs - 3.15.0__tar.gz → 3.15.2__tar.gz - Mend

TeLLMgramBot 3.15.0tar.gz → 3.15.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: TeLLMgramBot
-Version: 3.15.0
+Version: 3.15.2
 Summary: LLM-powered Telegram bot (OpenAI + Anthropic)
 Home-page: https://github.com/Digital-Heresy/TeLLMgramBot
 Author: Digital Heresy
@@ -22,6 +22,8 @@ Requires-Dist: tzdata>=2025.2
 Requires-Dist: pypdf>=6.0
 Requires-Dist: defusedxml>=0.7
 Requires-Dist: charset-normalizer>=3.0
+Requires-Dist: python-docx>=1.2
+Requires-Dist: openpyxl>=3.1
 Dynamic: author
 Dynamic: author-email
 Dynamic: description
@@ -45,7 +47,7 @@ The basic goal of this project is to create a bridge between a Telegram Bot and
   * Example: "What do you think of this article? [https://some_site/article]"
   * Uses a separate model (configurable via `url_model`) to handle larger URL content.
 * Share documents and text files for analysis and summarisation.
-  * Supported formats: PDF, plain-text files (.txt, .md, .rst, .csv, .json, etc.), HTML, and XML.
+  * Supported formats: PDF (via pypdf), Microsoft Office documents (.docx via python-docx, .xlsx via openpyxl), plain-text files (.txt, .md, .rst, .csv, .json, etc.), HTML, and XML (via defusedxml).
   * The bot extracts and summarises content, with automatic encoding detection for non-UTF-8 files. Files over 20 MB are rejected.
   * Can be disabled via `document_processing: false` in config.
 * Ask questions about message history across all your chats using natural language; the bot will search, attribute messages to speakers, and include messages from other bots.

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/README.md RENAMED Viewed

@@ -10,7 +10,7 @@ The basic goal of this project is to create a bridge between a Telegram Bot and
   * Example: "What do you think of this article? [https://some_site/article]"
   * Uses a separate model (configurable via `url_model`) to handle larger URL content.
 * Share documents and text files for analysis and summarisation.
-  * Supported formats: PDF, plain-text files (.txt, .md, .rst, .csv, .json, etc.), HTML, and XML.
+  * Supported formats: PDF (via pypdf), Microsoft Office documents (.docx via python-docx, .xlsx via openpyxl), plain-text files (.txt, .md, .rst, .csv, .json, etc.), HTML, and XML (via defusedxml).
   * The bot extracts and summarises content, with automatic encoding detection for non-UTF-8 files. Files over 20 MB are rejected.
   * Can be disabled via `document_processing: false` in config.
 * Ask questions about message history across all your chats using natural language; the bot will search, attribute messages to speakers, and include messages from other bots.

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/TeLLMgramBot.py RENAMED Viewed

@@ -63,6 +63,22 @@ _MSG_FORGET_PROMPT      = "Do you really want me to forget our memories together
 _MSG_FORGET_COMPLETE    = "Forget complete. Fresh start it is..."
 _MSG_FORGET_CANCELLED   = "Forget cancelled. Glad you changed your mind!"
+def _validated_allow_local(value) -> bool:
+    """
+    Strictly validate the allow_local_webhooks config value, which gates an SSRF guard.
+    Only a literal `True` enables it; any other truthy non-bool (e.g. a quoted "false"
+    string) must not, so logs a warning and defaults to False.
+    Args:
+        value: Raw `allow_local_webhooks` value from bot config.
+    """
+    if value is not None and not isinstance(value, bool):
+        logger.warning(f"Invalid allow_local_webhooks '{value}' (must be true/false); defaulting to false")
+    return value is True
 _SEARCH_TOOL = {
     "name": "search_messages",
     "description": (
@@ -926,17 +942,16 @@ class TelegramBot:
         """
         Route Telegram document messages through the document summarisation pipeline.
-        Group trigger conditions (caption @mention, nickname/initials match, or reply-to-bot)
-        are resolved via the shared _resolve_group_trigger() also used by tele_handle_message,
-        including the exclusive-foreign-mention yield on reply-to-bot threads. Silently ignores
-        documents in channels, edited messages, and in groups/supergroups where no trigger
-        condition matched. Once triggered, respects the same global online/offline gate as
-        tele_handle_response() (set via /start, /stop) - replies with the offline message
-        rather than processing while offline. When document_processing is disabled in config,
-        replies with a friendly message instead of processing - but only when the message was
-        otherwise triggered (private chat, or a matched group trigger); untriggered group
-        documents still yield silently regardless of the flag. Files over 20 MB receive a
-        friendly error before download.
+        Gates checked in order:
+          trigger    - shared _resolve_group_trigger() (mention/nickname/initials/reply-to-bot,
+                       incl. exclusive-foreign-mention yield); silent in channels, edited
+                       messages, and untriggered group messages.
+          online     - same self._online gate as tele_handle_response() (/start, /stop);
+                       offline reply if down.
+          processing - document_processing config flag; friendly reply if disabled, but
+                       only once triggered (untriggered group documents stay silent
+                       regardless of the flag).
+          file size  - friendly error over 20 MB, checked before download.
         The user message stored in DB is '[Document: filename] caption'; document
         bytes are never persisted. Respects is_private for cross-chat context isolation.
@@ -1228,6 +1243,39 @@ class TelegramBot:
         """Reply to unrecognized commands so the LLM never sees them."""
         await update.message.reply_text("Unknown command. Use /help to see available commands.")
+    async def _post_init(self, application: Application) -> None:
+        """
+        Schedule archival and MCP tool discovery as background tasks once the polling
+        event loop is live.
+        Registered as python-telegram-bot's post_init hook so a large archival backlog or slow/unreachable MCP
+        servers never block tele_handle_message/tele_handle_document from answering the first incoming update.
+        Uses application.create_task() rather than asyncio.create_task() so both tasks are tracked and
+        cancelled cleanly on shutdown.
+        """
+        if self._mcp_entries:
+            application.create_task(self._discover_mcp_tools_background())
+        application.create_task(run_archival(self.llm))
+    async def _discover_mcp_tools_background(self) -> None:
+        """
+        Discover MCP tools and merge them into self.webhook_schemas/self.webhook_defs.
+        Runs as a background task scheduled by _post_init() so MCP server round-trips
+        never block startup. Until this completes, owner-triggered tool calls simply see
+        the webhook-only tool set; MCP tools become available once discovery finishes.
+        """
+        try:
+            existing_names = set(self.webhook_defs.keys()) | {'search_messages'}
+            mcp_schemas, mcp_defs = await discover_mcp_tools(
+                self._mcp_entries, existing_names, allow_local=self._allow_local_webhooks,
+            )
+            self.webhook_schemas = self.webhook_schemas + mcp_schemas
+            self.webhook_defs = {**self.webhook_defs, **mcp_defs}
+            logger.info(f"MCP discovery complete: {len(mcp_schemas)} tool(s) registered")
+        except Exception:
+            logger.error("Background MCP discovery failed", exc_info=True)
     def poll(self):
         """
         Start the main polling loop for Telegram updates.
@@ -1240,14 +1288,14 @@ class TelegramBot:
     # Initialization
     def __init__(self,
-        bot_owner      = INIT_BOT_CONFIG['bot_owner'],
-        bot_nickname   = INIT_BOT_CONFIG['bot_nickname'],
-        bot_initials   = INIT_BOT_CONFIG['bot_initials'],
-        chat_model     = INIT_BOT_CONFIG['chat_model'],
-        url_model      = INIT_BOT_CONFIG['url_model'],
-        token_limit    = INIT_BOT_CONFIG['token_limit'],
-        search_limit   = INIT_BOT_CONFIG['search_limit'],
-        persona_temp   = INIT_BOT_CONFIG['persona_temp'],
+        bot_owner           = INIT_BOT_CONFIG['bot_owner'],
+        bot_nickname        = INIT_BOT_CONFIG['bot_nickname'],
+        bot_initials        = INIT_BOT_CONFIG['bot_initials'],
+        chat_model          = INIT_BOT_CONFIG['chat_model'],
+        url_model           = INIT_BOT_CONFIG['url_model'],
+        token_limit         = INIT_BOT_CONFIG['token_limit'],
+        search_limit        = INIT_BOT_CONFIG['search_limit'],
+        persona_temp        = INIT_BOT_CONFIG['persona_temp'],
         archive_days        = INIT_BOT_CONFIG['archive_days'],
         document_processing = INIT_BOT_CONFIG['document_processing'],
         persona_prompt      = INIT_BOT_CONFIG['persona_prompt'],
@@ -1255,6 +1303,8 @@ class TelegramBot:
         instance_name: str | None = None,
         webhook_schemas: list | None = None,
         webhook_defs: dict | None = None,
+        mcp_entries: list | None = None,
+        allow_local_webhooks: bool = False,
     ):
         """
         Initialize the Telegram bot with LLM configuration and API keys.
@@ -1280,6 +1330,10 @@ class TelegramBot:
                              If None, no webhook tools are registered.
             webhook_defs: Resolved webhook tool definitions keyed by tool name (from build_tool_registry).
                           If None, no webhook tools are registered.
+            mcp_entries: Raw 'mcp_server:' config entries (from the 'tools:' block), or None.
+                         _post_init() discovers these in the background and merges results into
+                         self.webhook_schemas/self.webhook_defs once discovery completes.
+            allow_local_webhooks: Passed through to discover_mcp_tools() when MCP discovery runs.
         Side Effects:
             - Normalises bot_owner to list[str] and stores in self.telegram['owners'].
@@ -1296,7 +1350,9 @@ class TelegramBot:
         self.token_warning = {}  # Determines whether user has reached token limit by AI model
         self.conversations = {}  # Provides Conversation class per user based on bot response
         self.webhook_schemas = webhook_schemas or []  # Provider-compatible schemas for webhook tools
-        self.webhook_defs = webhook_defs or {}        # Resolved tool definitions keyed by name
+        self.webhook_defs = webhook_defs or {}  # Resolved tool definitions keyed by name
+        self._mcp_entries = mcp_entries or []  # Raw mcp_server entries; discovered in _post_init()
+        self._allow_local_webhooks = allow_local_webhooks
         owners = bot_owner if isinstance(bot_owner, list) else [bot_owner]
         self.telegram = {
             'bot_id'       : 0,  # overwritten by _tele_info(); 0 is a safe sentinel
@@ -1319,7 +1375,12 @@ class TelegramBot:
             loop.create_task(self._tele_info())
         # Build our application with handlers for Commands, Messages, and Errors
-        self.telegram['app'] = Application.builder().token(os.environ['TELLMGRAMBOT_TELEGRAM_API_KEY']).build()
+        self.telegram['app'] = (
+            Application.builder()
+            .token(os.environ['TELLMGRAMBOT_TELEGRAM_API_KEY'])
+            .post_init(self._post_init)
+            .build()
+        )
         self.telegram['app'].add_handler(CommandHandler('help', self.tele_commands))
         self.telegram['app'].add_handler(CommandHandler('start', self.tele_start_command))
         self.telegram['app'].add_handler(CommandHandler('stop', self.tele_stop_command))
@@ -1391,9 +1452,10 @@ class TelegramBot:
         Calls init_structure() to bootstrap directories, API keys, and configuration files,
         unpacking a three-tuple (ApiKeyStatus, config dict, persona prompt str with system
-        appendix already appended). Builds webhook tool registry from 'tools:' config, discovers
-        MCP tools from any 'mcp_server:' entries, and merges both into the final tool registry.
-        Applies defaults for any missing values and returns a fully initialized TelegramBot.
+        appendix already appended). Builds the webhook tool registry from 'tools:' config and
+        collects any 'mcp_server:' entries (passed through to TelegramBot for later discovery -
+        see Side Effects). Applies defaults for any missing values and returns a fully
+        initialized TelegramBot.
         Args:
             config_file: Filename of the bot configuration YAML (default: 'config.yaml').
@@ -1406,50 +1468,42 @@ class TelegramBot:
         Side Effects:
             - Calls init_structure() which creates directories, config/prompt files, and checks API keys.
-            - Calls discover_mcp_tools() if any 'mcp_server:' entries are in config (gracefully degrades if called from async context).
-            - May log warnings (for missing config values, empty prompt, or skipped MCP discovery), but does not print a startup API key status summary.
+            - Passes any 'mcp_server:' entries to TelegramBot as mcp_entries; discovery itself runs
+              in the background via TelegramBot._post_init() once the polling event loop is live.
+            - May log warnings (for missing config values or an empty prompt), but does not print a
+              startup API key status summary.
             - Log identity/file label is taken from bot config `instance_name` when set; otherwise defaults to the bot's Telegram username once _tele_info() resolves.
         """
         # Bootstrap directories, logging, config, prompt (with appendix), and API keys in one call.
         key_status, config, prompt = init_structure(config_file, prompt_file)
         # Build the webhook tool registry from the optional 'tools:' block in bot config.
-        allow_local = config['allow_local_webhooks'] or False
+        allow_local = _validated_allow_local(config['allow_local_webhooks'])
         webhook_schemas, webhook_defs = build_tool_registry(config.get('tools') or [], allow_local)
-        # Discover MCP tools from any 'mcp_server:' entries in the tools config.
+        # Raw mcp_server entries; TelegramBot._post_init() runs discovery in the background.
         mcp_entries = [
             e for e in (config.get('tools') or [])
             if isinstance(e, dict) and 'mcp_server' in e
         ]
-        if mcp_entries:
-            existing_names = set(webhook_defs.keys()) | {'search_messages'}
-            mcp_schemas, mcp_defs = [], {}
-            try:
-                asyncio.get_running_loop()
-                logger.warning("MCP discovery skipped: set() called from within an async context.")
-            except RuntimeError:
-                mcp_schemas, mcp_defs = asyncio.run(discover_mcp_tools(
-                    mcp_entries, existing_names, allow_local=allow_local,
-                ))
-            webhook_schemas = webhook_schemas + mcp_schemas
-            webhook_defs = {**webhook_defs, **mcp_defs}
         # Apply parameters to bot:
         return TelegramBot(
-            bot_owner       = config['bot_owner'],
-            bot_nickname    = config['bot_nickname'],
-            bot_initials    = config['bot_initials'],
-            chat_model      = config['chat_model'],
-            url_model       = config['url_model'],
-            token_limit     = config['token_limit'],
-            search_limit    = config['search_limit'],
-            persona_temp    = config['persona_temp'],
-            archive_days        = config['archive_days'],
-            document_processing = config.get('document_processing'),
-            persona_prompt      = prompt,
-            key_status      = key_status,
-            instance_name   = config['instance_name'],
-            webhook_schemas = webhook_schemas,
-            webhook_defs    = webhook_defs,
+            bot_owner            = config['bot_owner'],
+            bot_nickname         = config['bot_nickname'],
+            bot_initials         = config['bot_initials'],
+            chat_model           = config['chat_model'],
+            url_model            = config['url_model'],
+            token_limit          = config['token_limit'],
+            search_limit         = config['search_limit'],
+            persona_temp         = config['persona_temp'],
+            archive_days         = config['archive_days'],
+            document_processing  = config['document_processing'],
+            persona_prompt       = prompt,
+            key_status           = key_status,
+            instance_name        = config['instance_name'],
+            webhook_schemas      = webhook_schemas,
+            webhook_defs         = webhook_defs,
+            mcp_entries          = mcp_entries,
+            allow_local_webhooks = allow_local,
         )

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/archive.py RENAMED Viewed

@@ -11,6 +11,7 @@ Tier 2: Episodic summarization - old Tier 1 rows compressed into thematic digest
 Raw messages are never deleted; archived_at flags rows to skip during context loading.
 Search still hits raw rows regardless of archived_at.
 """
+import asyncio
 import json
 import logging
@@ -24,6 +25,10 @@ logger = logging.getLogger(__name__)
 _archival_running = False
+# Caps concurrent in-flight LLM calls per run_archival() invocation so a large
+# backlog doesn't hammer the provider's rate limits.
+_BATCH_CONCURRENCY = 5
 _TIER1_PROMPT = (
     "Extract key facts from this conversation. "
     "Ignore greetings, acknowledgments, and filler. "
@@ -88,14 +93,34 @@ def _fmt_ts(ts: str) -> str:
     return ts[:16].replace('T', ' ') + ' UTC'
+async def _gather_bounded(coros) -> None:
+    """
+    Run an iterable of coroutines with at most _BATCH_CONCURRENCY in flight at once.
+    Pulls coroutines from the iterable lazily via a fixed pool of _BATCH_CONCURRENCY
+    workers, rather than wrapping every coroutine in a Task up front - so a large
+    archival backlog never creates more pending tasks than the concurrency cap.
+    Args:
+        coros: Iterable of coroutines to execute with bounded concurrency.
+    """
+    coro_iter = iter(coros)
+    async def _worker():
+        for coro in coro_iter:
+            await coro
+    await asyncio.gather(*(_worker() for _ in range(_BATCH_CONCURRENCY)))
 async def _run_tier1(config: dict) -> None:
     """
     Extract key facts from messages older than archive_days into Tier 1 rows.
     Groups old messages by chat and day, batches each group through the LLM with
     _TIER1_PROMPT, stores the extracted facts as summary_archive rows, and flags
-    source rows with archived_at. Logs warnings on batch failures but continues
-    processing other batches.
+    source rows with archived_at. Batches run concurrently, bounded by _BATCH_CONCURRENCY.
+    Logs warnings on batch failures but continues processing other batches.
     Args:
         config: Dict with keys: chat_model, and optionally archive_days.
@@ -131,12 +156,15 @@ async def _run_tier1(config: dict) -> None:
         batches.setdefault(key, []).append(row)
     provider = get_provider(model)
-    for (chat_id, day), batch in batches.items():
+    async def _process(chat_id, day, batch):
         try:
             await _process_tier1_batch(provider, model, chat_id, day, batch)
         except Exception as e:
             logger.warning(f"ARCHIVE: Tier 1 batch {chat_id}/{day} failed: {e}")
+    await _gather_bounded(_process(chat_id, day, batch) for (chat_id, day), batch in batches.items())
 async def _process_tier1_batch(provider, model: str, chat_id: int, day: str, rows: list) -> None:
     """
@@ -209,10 +237,10 @@ async def _run_tier2(config: dict) -> None:
     """
     Compress old Tier 1 rows into Tier 2 (episodic) summaries.
-    Groups Tier 1 rows older than archive_days * 2 by chat and month, batches each
-    group through the LLM with _TIER2_PROMPT, stores the result as a summary_archive
-    row (tier 2), and flags source Tier 1 rows with archived_at. Logs warnings on
-    batch failures but continues processing other batches.
+    Groups Tier 1 rows older than archive_days * 2 by chat and month, batches each group through the LLM
+    with _TIER2_PROMPT, stores the result as a summary_archive row (tier 2), and flags source Tier 1 rows
+    with archived_at. Batches run concurrently, bounded by _BATCH_CONCURRENCY.
+    Logs warnings on batch failures but continues processing other batches.
     Args:
         config: Dict with keys: chat_model, and optionally archive_days.
@@ -246,12 +274,15 @@ async def _run_tier2(config: dict) -> None:
         batches.setdefault(key, []).append(row)
     provider = get_provider(model)
-    for (chat_id, month), batch in batches.items():
+    async def _process(chat_id, month, batch):
         try:
             await _process_tier2_batch(provider, model, chat_id, month, batch)
         except Exception as e:
             logger.warning(f"ARCHIVE: Tier 2 batch {chat_id}/{month} failed: {e}")
+    await _gather_bounded(_process(chat_id, month, batch) for (chat_id, month), batch in batches.items())
 async def _process_tier2_batch(provider, model: str, chat_id: int, month: str, rows: list) -> None:
     """

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/initialize.py RENAMED Viewed

@@ -13,7 +13,6 @@ from typing import Optional
 from .utils import read_text, generate_file_path, read_yaml, execution_dir, generate_filename
 from .database import set_db_filename, init_db, get_db_path
-from .archive import run_archival
 logger = logging.getLogger(__name__)
 _logging_initialized = False
@@ -529,6 +528,9 @@ def init_structure(
     the custom filename (only if the target does not yet exist). This ordering ensures the DB
     filename is settled before schema init.
+    Archival (Tier 1 and Tier 2) runs as a background task scheduled by TelegramBot._post_init()
+    once the polling event loop is live, so large backlogs never block the bot's first message response.
     Args:
         config_file: Name of the bot configuration file (default: 'config.yaml').
                      Resolved relative to TELLMGRAMBOT_CONFIGS_PATH.
@@ -576,9 +578,9 @@ def init_structure(
         else:
             logger.debug(f"DB migration skipped: conversations.db not found at {legacy_path}")
-    # In the sync path (no running loop), run DB init now before keys are loaded.
-    # In the async path (running loop), DB init is deferred into the background task
-    # so it runs after keys are loaded (below).
+    # In the sync path (no running loop), run DB init now.
+    # In the async path (running loop - e.g. a caller awaiting init_structure() from
+    # inside an already-async context), DB init is deferred into a background task.
     try:
         loop = asyncio.get_running_loop()
         _has_loop = True
@@ -593,18 +595,14 @@ def init_structure(
         logger.warning(f"File '{prompt_file}' is empty, using default persona prompt.")
     prompt = prompt.rstrip() + "\n\n" + _SYSTEM_APPENDIX
-    # Load API keys before running archival so the provider can authenticate.
     key_status = init_keys(config)
     if _has_loop:
-        async def _init_and_archive():
+        async def _init_db_task():
             try:
                 await init_db()
-                await run_archival(config)
             except Exception:
-                logger.error(f"Background startup initialization/archive task failed", exc_info=True)
-        loop.create_task(_init_and_archive())
-    else:
-        asyncio.run(run_archival(config))
+                logger.error("Background startup DB initialization failed", exc_info=True)
+        loop.create_task(_init_db_task())
     return (key_status, config, prompt)

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/message_handlers.py RENAMED Viewed

@@ -2,11 +2,14 @@
 import io
 import logging
 import re
+import zipfile
 from pathlib import Path
 from typing import Optional
 from charset_normalizer import from_bytes as _cn_from_bytes
 import defusedxml.ElementTree as _defusedxml_ET
+import docx
+import openpyxl
 import pypdf
 from .utils import log_error
@@ -56,7 +59,13 @@ _PLAIN_TEXT_EXTENSIONS = frozenset({
 _HTML_MIMES = frozenset({'text/html', 'application/xhtml+xml'})
 _XML_MIMES  = frozenset({'text/xml', 'application/xml'})
+_DOCX_MIME  = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
+_XLSX_MIME  = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
 _PDF_PAGE_CAP = 100
+_DOCX_LINE_CAP = 2000
+_XLSX_ROW_CAP = 2000
+_ZIP_ENTRY_SIZE_CAP = 100_000_000  # 100 MB uncompressed per entry - zip-bomb guard
+_ZIP_TOTAL_SIZE_CAP = 500_000_000  # 500 MB uncompressed across all entries combined
 def handle_greetings(text: str) -> Optional[str]:
@@ -87,6 +96,39 @@ def handle_common_queries(text: str) -> Optional[str]:
     return None
+def _zip_entries_within_cap(file_bytes: bytes) -> bool:
+    """
+    Check declared uncompressed entry sizes in a zip-based file before decompression.
+    Guards against zip-bomb style resource exhaustion for .docx/.xlsx (both are zip
+    containers), without actually decompressing anything. Rejects a file if any single
+    entry's declared uncompressed size exceeds _ZIP_ENTRY_SIZE_CAP, or if the sum across
+    all entries exceeds _ZIP_TOTAL_SIZE_CAP (catches many small entries that each pass
+    the per-entry cap but together still exhaust memory). Returns False for a corrupted/
+    non-zip file too, since python-docx/openpyxl would fail on it anyway.
+    Args:
+        file_bytes: Raw file bytes to inspect.
+    Returns:
+        True if the file is a valid zip, every entry's declared size is within the
+        per-entry cap, and the total declared size is within the total cap; False
+        otherwise.
+    """
+    try:
+        with zipfile.ZipFile(io.BytesIO(file_bytes)) as zf:
+            total = 0
+            for info in zf.infolist():
+                if info.file_size > _ZIP_ENTRY_SIZE_CAP:
+                    return False
+                total += info.file_size
+                if total > _ZIP_TOTAL_SIZE_CAP:
+                    return False
+            return True
+    except Exception:
+        return False
 def _decode_bytes(raw: bytes) -> tuple:
     """
     Decode raw bytes to a string via UTF-8 -> charset-normalizer -> Latin-1 chain.
@@ -108,12 +150,16 @@ def _extract_document_text(file_bytes: bytes, mime_type: str, filename: str) ->
     """
     Extract plain text from document bytes, routing by MIME type and file extension.
-    PDF text is extracted via pypdf (capped at _PDF_PAGE_CAP pages, strict=False).
-    HTML content has tags stripped via strip_html_markup. XML is safely parsed via
-    defusedxml to extract text nodes without XXE risk; falls back to plain-text
-    decode if the XML is malformed. All other plain-text types are decoded using
-    a UTF-8 -> charset-normalizer -> Latin-1 chain; non-UTF-8 files prepend a
-    [File encoding: ...] annotation so the LLM has context.
+    Extraction by type:
+      PDF   - pypdf, capped at _PDF_PAGE_CAP pages, strict=False.
+      .docx - python-docx; paragraphs + flattened table rows, capped at _DOCX_LINE_CAP combined lines.
+      .xlsx - openpyxl in read_only/data_only mode; capped at _XLSX_ROW_CAP rows total across all sheets.
+              .docx/.xlsx are zip containers, so _zip_entries_within_cap() rejects implausible declared
+              uncompressed entry sizes before either library decompresses anything (zip-bomb guard).
+      HTML  - tags stripped via strip_html_markup.
+      XML   - defusedxml (XXE-safe); falls back to plain-text decode if malformed.
+      other - UTF-8 -> charset-normalizer -> Latin-1 chain
+              non-UTF-8 files get a [File encoding: ...] prefix.
     Args:
         file_bytes: Raw document bytes downloaded from Telegram.
@@ -128,6 +174,8 @@ def _extract_document_text(file_bytes: bytes, mime_type: str, filename: str) ->
     ext  = Path(filename).suffix.lower() if filename else ''
     is_pdf      = mime == 'application/pdf' or ext == '.pdf'
+    is_docx     = ext == '.docx' or mime == _DOCX_MIME
+    is_xlsx     = ext == '.xlsx' or mime == _XLSX_MIME
     is_html     = ext in ('.html', '.htm') or mime in _HTML_MIMES
     is_xml      = ext == '.xml' or mime in _XML_MIMES
     is_plain    = mime.startswith('text/') or ext in _PLAIN_TEXT_EXTENSIONS
@@ -145,6 +193,67 @@ def _extract_document_text(file_bytes: bytes, mime_type: str, filename: str) ->
             log_error(e, 'PDF')
             return None, "Something went wrong while reading that PDF. Please try again."
+    if is_docx:
+        if not _zip_entries_within_cap(file_bytes):
+            return None, "Something went wrong while reading that document. Please try again."
+        try:
+            document = docx.Document(io.BytesIO(file_bytes))
+            lines = []
+            for p in document.paragraphs:
+                if len(lines) >= _DOCX_LINE_CAP:
+                    break
+                if p.text:
+                    lines.append(p.text)
+            for table in document.tables:
+                if len(lines) >= _DOCX_LINE_CAP:
+                    break
+                for row in table.rows:
+                    if len(lines) >= _DOCX_LINE_CAP:
+                        break
+                    row_text = '\t'.join(cell.text for cell in row.cells)
+                    if row_text.strip():
+                        lines.append(row_text)
+            text = '\n'.join(lines)
+            if not text.strip():
+                return None, "This document appears to have no readable text in it."
+            return text, None
+        except Exception as e:
+            log_error(e, 'DOCX')
+            return None, "Something went wrong while reading that document. Please try again."
+    if is_xlsx:
+        if not _zip_entries_within_cap(file_bytes):
+            return None, "Something went wrong while reading that spreadsheet. Please try again."
+        workbook = None
+        try:
+            workbook = openpyxl.load_workbook(io.BytesIO(file_bytes), read_only=True, data_only=True)
+            lines = []
+            total_rows = 0
+            for sheet in workbook.worksheets:
+                if total_rows >= _XLSX_ROW_CAP:
+                    break
+                sheet_lines = []
+                for row in sheet.iter_rows(values_only=True):
+                    if total_rows >= _XLSX_ROW_CAP:
+                        break
+                    values = ['' if v is None else str(v) for v in row]
+                    if any(values):
+                        sheet_lines.append('\t'.join(values))
+                        total_rows += 1
+                if sheet_lines:
+                    lines.append(f"## Sheet: {sheet.title}")
+                    lines.extend(sheet_lines)
+            text = '\n'.join(lines)
+            if not text.strip():
+                return None, "This spreadsheet appears to have no readable data in it."
+            return text, None
+        except Exception as e:
+            log_error(e, 'XLSX')
+            return None, "Something went wrong while reading that spreadsheet. Please try again."
+        finally:
+            if workbook is not None:
+                workbook.close()
     if is_html:
         raw_text, _ = _decode_bytes(file_bytes)
         return strip_html_markup(raw_text), None
@@ -166,7 +275,7 @@ def _extract_document_text(file_bytes: bytes, mime_type: str, filename: str) ->
             text = f"[File encoding: {encoding}]\n{text}"
         return text, None
-    return None, "I can only read plain text and PDF files right now."
+    return None, "I can only read PDF, Word documents (.docx), spreadsheets (.xlsx), HTML, XML, and plain text files right now."
 async def summarise_text(
@@ -179,12 +288,10 @@ async def summarise_text(
     """
     Token-prune content, apply template, and complete via the LLM.
-    Prunes content so the fully composed system message (prompt + template with content
-    substituted) fits within the model's token budget (max_tokens - 500), then calls the LLM.
-    Token counting is measured against the composed message at every pruning step, not just
-    the raw content, so the budget guarantee matches what is actually sent to the provider -
-    a large template or persona prompt is accounted for, not just the content itself. The
-    template must contain a {content} placeholder.
+    Prunes content so the fully composed message (prompt + template + content) fits the
+    model's token budget (max_tokens - 500), measuring the composed message at every
+    pruning step - not just raw content - so the budget guarantee matches what's actually
+    sent. The template must contain a {content} placeholder.
     Args:
         content: Text content to summarise (URL body or document text).

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: TeLLMgramBot
-Version: 3.15.0
+Version: 3.15.2
 Summary: LLM-powered Telegram bot (OpenAI + Anthropic)
 Home-page: https://github.com/Digital-Heresy/TeLLMgramBot
 Author: Digital Heresy
@@ -22,6 +22,8 @@ Requires-Dist: tzdata>=2025.2
 Requires-Dist: pypdf>=6.0
 Requires-Dist: defusedxml>=0.7
 Requires-Dist: charset-normalizer>=3.0
+Requires-Dist: python-docx>=1.2
+Requires-Dist: openpyxl>=3.1
 Dynamic: author
 Dynamic: author-email
 Dynamic: description
@@ -45,7 +47,7 @@ The basic goal of this project is to create a bridge between a Telegram Bot and
   * Example: "What do you think of this article? [https://some_site/article]"
   * Uses a separate model (configurable via `url_model`) to handle larger URL content.
 * Share documents and text files for analysis and summarisation.
-  * Supported formats: PDF, plain-text files (.txt, .md, .rst, .csv, .json, etc.), HTML, and XML.
+  * Supported formats: PDF (via pypdf), Microsoft Office documents (.docx via python-docx, .xlsx via openpyxl), plain-text files (.txt, .md, .rst, .csv, .json, etc.), HTML, and XML (via defusedxml).
   * The bot extracts and summarises content, with automatic encoding detection for non-UTF-8 files. Files over 20 MB are rejected.
   * Can be disabled via `document_processing: false` in config.
 * Ask questions about message history across all your chats using natural language; the bot will search, attribute messages to speakers, and include messages from other bots.

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot.egg-info/requires.txt RENAMED Viewed

@@ -11,3 +11,5 @@ tzdata>=2025.2
 pypdf>=6.0
 defusedxml>=0.7
 charset-normalizer>=3.0
+python-docx>=1.2
+openpyxl>=3.1

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/setup.py RENAMED Viewed

@@ -5,7 +5,7 @@ with open("README.md", "r") as fh:
 setup(
     name='TeLLMgramBot',
-    version='3.15.0',
+    version='3.15.2',
     packages=find_packages(),
     license='MIT',
     author='Digital Heresy',
@@ -28,6 +28,8 @@ setup(
         'pypdf>=6.0',
         'defusedxml>=0.7',
         'charset-normalizer>=3.0',
+        'python-docx>=1.2',
+        'openpyxl>=3.1',
     ],
     python_requires='>=3.10'
 )

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/LICENSE RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/__init__.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/conversation.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/database.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/models.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/providers/__init__.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/providers/anthropic_provider.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/providers/base.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/providers/factory.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/providers/openai_provider.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/tools.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/utils.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot/web_utils.py RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot.egg-info/SOURCES.txt RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/TeLLMgramBot.egg-info/top_level.txt RENAMED Viewed

File without changes

{tellmgrambot-3.15.0 → tellmgrambot-3.15.2}/setup.cfg RENAMED Viewed

File without changes

TeLLMgramBot 3.15.0__tar.gz → 3.15.2__tar.gz

TeLLMgramBot 3.15.0tar.gz → 3.15.2tar.gz