PyPI - ursa-ai - Versions diffs - 0.0.3__tar.gz → 0.2.2__tar.gz - Mend

ursa-ai 0.0.3tar.gz → 0.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of ursa-ai might be problematic. Click here for more details.

Files changed (38) hide show

ursa_ai-0.2.2/LICENSE +8 -0
ursa_ai-0.2.2/PKG-INFO +130 -0
ursa_ai-0.2.2/README.md +88 -0
ursa_ai-0.2.2/pyproject.toml +77 -0
ursa_ai-0.2.2/src/ursa/agents/__init__.py +10 -0
ursa_ai-0.2.2/src/ursa/agents/arxiv_agent.py +349 -0
ursa_ai-0.2.2/src/ursa/agents/base.py +42 -0
ursa_ai-0.2.2/src/ursa/agents/code_review_agent.py +332 -0
ursa_ai-0.2.2/src/ursa/agents/execution_agent.py +497 -0
ursa_ai-0.2.2/src/ursa/agents/hypothesizer_agent.py +597 -0
ursa_ai-0.2.2/src/ursa/agents/mp_agent.py +257 -0
ursa_ai-0.2.2/src/ursa/agents/planning_agent.py +138 -0
ursa_ai-0.2.2/src/ursa/agents/recall_agent.py +25 -0
ursa_ai-0.2.2/src/ursa/agents/websearch_agent.py +193 -0
ursa_ai-0.2.2/src/ursa/prompt_library/code_review_prompts.py +51 -0
ursa_ai-0.2.2/src/ursa/prompt_library/execution_prompts.py +36 -0
ursa_ai-0.2.2/src/ursa/prompt_library/hypothesizer_prompts.py +17 -0
ursa_ai-0.2.2/src/ursa/prompt_library/literature_prompts.py +11 -0
ursa_ai-0.2.2/src/ursa/prompt_library/planning_prompts.py +79 -0
ursa_ai-0.2.2/src/ursa/prompt_library/websearch_prompts.py +131 -0
ursa_ai-0.2.2/src/ursa/tools/run_command.py +27 -0
ursa_ai-0.2.2/src/ursa/tools/write_code.py +42 -0
ursa_ai-0.2.2/src/ursa/util/diff_renderer.py +121 -0
ursa_ai-0.2.2/src/ursa/util/memory_logger.py +171 -0
ursa_ai-0.2.2/src/ursa/util/parse.py +89 -0
ursa_ai-0.2.2/src/ursa_ai.egg-info/PKG-INFO +130 -0
ursa_ai-0.2.2/src/ursa_ai.egg-info/SOURCES.txt +29 -0
ursa_ai-0.2.2/src/ursa_ai.egg-info/requires.txt +21 -0
ursa_ai-0.0.3/PKG-INFO +0 -7
ursa_ai-0.0.3/README.md +0 -0
ursa_ai-0.0.3/pyproject.toml +0 -33
ursa_ai-0.0.3/src/ursa/__init__.py +0 -2
ursa_ai-0.0.3/src/ursa/py.typed +0 -0
ursa_ai-0.0.3/src/ursa_ai.egg-info/PKG-INFO +0 -7
ursa_ai-0.0.3/src/ursa_ai.egg-info/SOURCES.txt +0 -8
{ursa_ai-0.0.3 → ursa_ai-0.2.2}/setup.cfg +0 -0
{ursa_ai-0.0.3 → ursa_ai-0.2.2}/src/ursa_ai.egg-info/dependency_links.txt +0 -0
{ursa_ai-0.0.3 → ursa_ai-0.2.2}/src/ursa_ai.egg-info/top_level.txt +0 -0

ursa_ai-0.2.2/LICENSE ADDED Viewed

@@ -0,0 +1,8 @@
+This program is Open-Source under the BSD-3 License.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+    Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+    Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+    Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

ursa_ai-0.2.2/PKG-INFO ADDED Viewed

@@ -0,0 +1,130 @@
+Metadata-Version: 2.4
+Name: ursa-ai
+Version: 0.2.2
+Summary: Agents for science at LANL
+Author-email: Mike Grosskopf <mikegros@lanl.gov>, Rahul Somasundaram <rsomasundaram@lanl.gov>, Arthur Lui <alui@lanl.gov>
+Project-URL: Homepage, https://github.com/lanl/ursa
+Project-URL: Documentation, https://github.com/lanl/ursa/tree/main/docs
+Project-URL: Repository, https://github.com/lanl/ursa
+Project-URL: Issues, https://github.com/lanl/ursa/issues
+Classifier: Operating System :: OS Independent
+Classifier: License :: OSI Approved :: BSD License
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: 3.13
+Classifier: Programming Language :: Python :: 3.14
+Requires-Python: >=3.10
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: arxiv<3.0,>=2.2.0
+Requires-Dist: beautifulsoup4<5.0,>=4.13.4
+Requires-Dist: coolname<3.0,>=2.2.0
+Requires-Dist: langchain<0.4,>=0.3.22
+Requires-Dist: langchain-community<0.4,>=0.3.20
+Requires-Dist: langchain-litellm>=0.2.2
+Requires-Dist: langchain-openai<0.4,>=0.3.12
+Requires-Dist: langgraph>=0.5
+Requires-Dist: pandas<3.0,>=2.2.3
+Requires-Dist: pillow>=11.2.1
+Requires-Dist: pymupdf<2.0,>=1.26.0
+Requires-Dist: pypdf<6.0,>=5.4.0
+Requires-Dist: rich<14.0,>=13.7.0
+Requires-Dist: langchain-chroma>=0.2.4
+Requires-Dist: chromadb>=1.0.15
+Requires-Dist: mp-api>=0.45.8
+Requires-Dist: langchain-google-genai>=2.1.9
+Requires-Dist: langchain-anthropic>=0.3.18
+Requires-Dist: langgraph-checkpoint-sqlite>=2.0.10
+Requires-Dist: duckduckgo-search>=8.1.1
+Requires-Dist: langchain-ollama>=0.3.6
+Dynamic: license-file
+# URSA - The Universal Research and Scientific Agent
+<img src="./logos/logo.png" alt="URSA Logo" width="200" height="200">
+[![PyPI Version][pypi-version]](https://pypi.org/project/ursa-ai/)
+[![PyPI Downloads][total-downloads]](https://pepy.tech/projects/ursa-ai)
+The flexible agentic workflow for accelerating scientific tasks.
+Composes information flow between agents for planning, code writing and execution, and online research to solve complex problems.
+## Installation
+You can install `ursa` via `pip` or `uv`.
+**pip**
+```bash
+pip install ursa-ai
+```
+**uv**
+```bash
+uv add ursa-ai
+```
+## How to use this code
+Better documentation will be incoming, but for now there are examples in the examples folder that should give
+a decent idea for how to set up some basic problems. They also should give some idea of how to pass results from
+one agent to another. I will look to add things with multi-agent graphs, etc. in the future.
+Documentation for each URSA agent:
+- [Planning Agent](docs/planning_agent.md)
+- [Execution Agent](docs/execution_agent.md)
+- [ArXiv Agent](docs/arxiv_agent.md)
+- [Web Search Agent](docs/web_search_agent.md)
+- [Hypothesizer Agent](docs/hypothesizer_agent.md)
+Documentation for combining agents:
+- [ArXiv -> Execution for Materials](docs/combining_arxiv_and_execution.md)
+- [ArXiv -> Execution for Neutron Star Properties](docs/combining_arxiv_and_execution_neutronStar.md)
+# Sandboxing
+The Execution Agent is allowed to run system commands and write/run code. Being able to execute arbitrary system commands or write
+and execute code has the potential to cause problems like:
+- Damage code or data on the computer
+- Damage the computer
+- Transmit your local data
+The Web Search Agent scrapes data from urls, so has the potential to attempt to pull information from questionable sources.
+Some suggestions for sandboxing the agent:
+- Creating a specific environment such that limits URSA's access to only what you want. Examples:
+    - Creating/using a virtual machine that is sandboxed from the rest of your machine
+    - Creating a new account on your machine specifically for URSA
+- Creating a network blacklist/whitelist to ensure that network commands and webscraping are contained to safe sources
+You have a duty for ensuring that you use URSA responsibly.
+## Development Dependencies
+* [`uv`](https://docs.astral.sh/uv/)
+    * `uv` is an extremely fast python package and project manager, written in Rust.
+      Follow installation instructions
+      [here](https://docs.astral.sh/uv/getting-started/installation/)
+* [`ruff`](https://docs.astral.sh/ruff/)
+    * An extremely fast Python linter and code formatter, written in Rust.
+    * After installing `uv`, you can install just ruff `uv tool install ruff`
+* [`just`](https://github.com/casey/just)
+    * A modern way to save and run project-specific commands
+    * After installing `uv`, you can install just with `uv tool install rust-just`
+## Development Team
+URSA has been developed at Los Alamos National Laboratory as part of the ArtIMis project.
+<img src="./logos/artimis.png" alt="ArtIMis Logo" width="200" height="200">
+### Notice of Copyright Assertion (O4958):
+*This program is Open-Source under the BSD-3 License.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:*
+- *Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.*
+- *Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.*
+- *Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.*
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+[pypi-version]: https://img.shields.io/pypi/v/ursa-ai?style=flat-square&label=PyPI
+[total-downloads]: https://img.shields.io/pepy/dt/ursa-ai?style=flat-square&label=downloads&color=blue

ursa_ai-0.2.2/README.md ADDED Viewed

@@ -0,0 +1,88 @@
+# URSA - The Universal Research and Scientific Agent
+<img src="./logos/logo.png" alt="URSA Logo" width="200" height="200">
+[![PyPI Version][pypi-version]](https://pypi.org/project/ursa-ai/)
+[![PyPI Downloads][total-downloads]](https://pepy.tech/projects/ursa-ai)
+The flexible agentic workflow for accelerating scientific tasks.
+Composes information flow between agents for planning, code writing and execution, and online research to solve complex problems.
+## Installation
+You can install `ursa` via `pip` or `uv`.
+**pip**
+```bash
+pip install ursa-ai
+```
+**uv**
+```bash
+uv add ursa-ai
+```
+## How to use this code
+Better documentation will be incoming, but for now there are examples in the examples folder that should give
+a decent idea for how to set up some basic problems. They also should give some idea of how to pass results from
+one agent to another. I will look to add things with multi-agent graphs, etc. in the future.
+Documentation for each URSA agent:
+- [Planning Agent](docs/planning_agent.md)
+- [Execution Agent](docs/execution_agent.md)
+- [ArXiv Agent](docs/arxiv_agent.md)
+- [Web Search Agent](docs/web_search_agent.md)
+- [Hypothesizer Agent](docs/hypothesizer_agent.md)
+Documentation for combining agents:
+- [ArXiv -> Execution for Materials](docs/combining_arxiv_and_execution.md)
+- [ArXiv -> Execution for Neutron Star Properties](docs/combining_arxiv_and_execution_neutronStar.md)
+# Sandboxing
+The Execution Agent is allowed to run system commands and write/run code. Being able to execute arbitrary system commands or write
+and execute code has the potential to cause problems like:
+- Damage code or data on the computer
+- Damage the computer
+- Transmit your local data
+The Web Search Agent scrapes data from urls, so has the potential to attempt to pull information from questionable sources.
+Some suggestions for sandboxing the agent:
+- Creating a specific environment such that limits URSA's access to only what you want. Examples:
+    - Creating/using a virtual machine that is sandboxed from the rest of your machine
+    - Creating a new account on your machine specifically for URSA
+- Creating a network blacklist/whitelist to ensure that network commands and webscraping are contained to safe sources
+You have a duty for ensuring that you use URSA responsibly.
+## Development Dependencies
+* [`uv`](https://docs.astral.sh/uv/)
+    * `uv` is an extremely fast python package and project manager, written in Rust.
+      Follow installation instructions
+      [here](https://docs.astral.sh/uv/getting-started/installation/)
+* [`ruff`](https://docs.astral.sh/ruff/)
+    * An extremely fast Python linter and code formatter, written in Rust.
+    * After installing `uv`, you can install just ruff `uv tool install ruff`
+* [`just`](https://github.com/casey/just)
+    * A modern way to save and run project-specific commands
+    * After installing `uv`, you can install just with `uv tool install rust-just`
+## Development Team
+URSA has been developed at Los Alamos National Laboratory as part of the ArtIMis project.
+<img src="./logos/artimis.png" alt="ArtIMis Logo" width="200" height="200">
+### Notice of Copyright Assertion (O4958):
+*This program is Open-Source under the BSD-3 License.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:*
+- *Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.*
+- *Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.*
+- *Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.*
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+[pypi-version]: https://img.shields.io/pypi/v/ursa-ai?style=flat-square&label=PyPI
+[total-downloads]: https://img.shields.io/pepy/dt/ursa-ai?style=flat-square&label=downloads&color=blue

ursa_ai-0.2.2/pyproject.toml ADDED Viewed

@@ -0,0 +1,77 @@
+[project]
+name = "ursa-ai"
+dynamic = ["version"]
+description = "Agents for science at LANL"
+readme = "README.md"
+authors = [
+    { name = "Mike Grosskopf", email = "mikegros@lanl.gov" },
+    { name = "Rahul Somasundaram", email = "rsomasundaram@lanl.gov" },
+    { name = "Arthur Lui", email = "alui@lanl.gov" }
+]
+requires-python = ">=3.10"
+dependencies = [
+    "arxiv>=2.2.0,<3.0",
+    "beautifulsoup4>=4.13.4,<5.0",
+    "coolname>=2.2.0,<3.0",
+    "langchain>=0.3.22,<0.4",
+    "langchain-community>=0.3.20,<0.4",
+    "langchain-litellm>=0.2.2",
+    "langchain-openai>=0.3.12,<0.4",
+    "langgraph>=0.5",
+    "pandas>=2.2.3,<3.0",
+    "pillow>=11.2.1",
+    "pymupdf>=1.26.0,<2.0",
+    "pypdf>=5.4.0,<6.0",
+    "rich>=13.7.0,<14.0",
+    "langchain-chroma>=0.2.4",
+    "chromadb>=1.0.15",
+    "mp-api>=0.45.8",
+    "langchain-google-genai>=2.1.9",
+    "langchain-anthropic>=0.3.18",
+    "langgraph-checkpoint-sqlite>=2.0.10",
+    "duckduckgo-search>=8.1.1",
+    "langchain-ollama>=0.3.6",
+]
+classifiers = [
+    "Operating System :: OS Independent",
+    "License :: OSI Approved :: BSD License",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Programming Language :: Python :: 3.14",
+]
+[project.urls]
+Homepage = "https://github.com/lanl/ursa"
+Documentation = "https://github.com/lanl/ursa/tree/main/docs"
+Repository = "https://github.com/lanl/ursa"
+Issues = "https://github.com/lanl/ursa/issues"
+[build-system]
+requires = ["setuptools>=74.1", "setuptools-git-versioning>=2.0,<3"]
+build-backend = "setuptools.build_meta"
+[tool.setuptools-git-versioning]
+enabled = true
+[tool.ruff]
+line-length = 80
+[tool.ruff.lint]
+ignore = ["D100"]
+extend-select = ["I", "W505"] # "D"
+extend-unsafe-fixes = ["F401"]
+pydocstyle.convention = "numpy"
+pycodestyle.max-doc-length = 80
+# Ignore test file documentation linting.
+[tool.ruff.lint.extend-per-file-ignores]
+"tests/**/*.py" = ["D"]
+[dependency-groups]
+dev = [
+    "langgraph-checkpoint-sqlite>=2.0.10",
+    "notebook>=7.3.3",
+    "scikit-optimize>=0.10.2",
+]

ursa_ai-0.2.2/src/ursa/agents/__init__.py ADDED Viewed

@@ -0,0 +1,10 @@
+from .planning_agent          import     PlanningAgent,     PlanningState
+from .websearch_agent         import    WebSearchAgent,    WebSearchState
+from .execution_agent         import    ExecutionAgent,    ExecutionState
+from .code_review_agent       import   CodeReviewAgent,   CodeReviewState
+from .hypothesizer_agent      import HypothesizerAgent, HypothesizerState
+from .arxiv_agent             import ArxivAgent, PaperState, PaperMetadata
+from .recall_agent            import RecallAgent
+from .base                    import BaseAgent, BaseChatModel
+from .mp_agent                import MaterialsProjectAgent

ursa_ai-0.2.2/src/ursa/agents/arxiv_agent.py ADDED Viewed

@@ -0,0 +1,349 @@
+import os
+import pymupdf
+import requests
+import feedparser
+from PIL import Image
+from io import BytesIO
+import base64
+from urllib.parse import quote
+from typing_extensions import TypedDict, List
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+import statistics
+import re
+from langchain_community.document_loaders import PyPDFLoader
+from langchain_core.output_parsers import StrOutputParser
+from langchain_core.prompts import ChatPromptTemplate
+from langgraph.graph import StateGraph, END, START
+from langchain.text_splitter import RecursiveCharacterTextSplitter
+from langchain_chroma import Chroma
+from .base import BaseAgent
+try:
+    from openai import OpenAI
+except:
+    pass
+# embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
+# embeddings = OpenAIEmbeddings()
+class PaperMetadata(TypedDict):
+    arxiv_id: str
+    full_text: str
+class PaperState(TypedDict, total=False):
+    query: str
+    context: str
+    papers: List[PaperMetadata]
+    summaries: List[str]
+    final_summary: str
+def describe_image(image: Image.Image) -> str:
+    if 'OpenAI' not in globals():
+        print("Vision transformer for summarizing images currently only implemented for OpenAI API.")
+        return ""
+    client = OpenAI()
+    buffered = BytesIO()
+    image.save(buffered, format="PNG")
+    img_base64 = base64.b64encode(buffered.getvalue()).decode()
+    response = client.chat.completions.create(
+        model="gpt-4-vision-preview",
+        messages=[
+            {"role": "system", "content": "You are a scientific assistant who explains plots and scientific diagrams."},
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": "Describe this scientific image or plot in detail."},
+                    {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_base64}"}}
+                ],
+            },
+        ],
+        max_tokens=500,
+    )
+    return response.choices[0].message.content.strip()
+def extract_and_describe_images(pdf_path: str, max_images: int = 5) -> List[str]:
+    doc = pymupdf.open(pdf_path)
+    descriptions = []
+    image_count = 0
+    for page_index in range(len(doc)):
+        if image_count >= max_images:
+            break
+        page = doc[page_index]
+        images = page.get_images(full=True)
+        for img_index, img in enumerate(images):
+            if image_count >= max_images:
+                break
+            xref = img[0]
+            base_image = doc.extract_image(xref)
+            image_bytes = base_image["image"]
+            image = Image.open(BytesIO(image_bytes))
+            try:
+                desc = describe_image(image)
+                descriptions.append(f"Page {page_index + 1}, Image {img_index + 1}: {desc}")
+            except Exception as e:
+                descriptions.append(f"Page {page_index + 1}, Image {img_index + 1}: [Error: {e}]")
+            image_count += 1
+    return descriptions
+def remove_surrogates(text: str) -> str:
+    return re.sub(r'[\ud800-\udfff]', '', text)
+class ArxivAgent(BaseAgent):
+    def __init__(self,
+                 llm="openai/o3-mini",
+                 summarize: bool = True,
+                 process_images = True,
+                 max_results: int = 3,
+                 download_papers: bool = True,
+                 rag_embedding         = None,
+                 database_path      ='arxiv_papers',
+                 summaries_path     ='arxiv_generated_summaries',
+                 vectorstore_path   ='arxiv_vectorstores',**kwargs):
+        super().__init__(llm, **kwargs)
+        self.summarize        = summarize
+        self.process_images   = process_images
+        self.max_results      = max_results
+        self.database_path    = database_path
+        self.summaries_path   = summaries_path
+        self.vectorstore_path = vectorstore_path
+        self.download_papers  = download_papers
+        self.rag_embedding    = rag_embedding
+        self.graph = self._build_graph()
+        os.makedirs(self.database_path, exist_ok=True)
+        os.makedirs(self.summaries_path, exist_ok=True)
+    def _fetch_papers(self, query: str) -> List[PaperMetadata]:
+        if self.download_papers:
+            encoded_query = quote(query)
+            url = f"http://export.arxiv.org/api/query?search_query=all:{encoded_query}&start=0&max_results={self.max_results}"
+            feed = feedparser.parse(url)
+            for i,entry in enumerate(feed.entries):
+                full_id = entry.id.split('/abs/')[-1]
+                arxiv_id = full_id.split('/')[-1]
+                title = entry.title.strip()
+                authors = ", ".join(author.name for author in entry.authors)
+                pdf_url = f"https://arxiv.org/pdf/{full_id}.pdf"
+                pdf_filename = os.path.join(self.database_path, f"{arxiv_id}.pdf")
+                if os.path.exists(pdf_filename):
+                    print(f"Paper # {i+1}, Title: {title}, already exists in database")
+                else:
+                    print(f"Downloading paper # {i+1}, Title: {title}")
+                    response = requests.get(pdf_url)
+                    with open(pdf_filename, 'wb') as f:
+                        f.write(response.content)
+        papers = []
+        pdf_files = [f for f in os.listdir(self.database_path) if f.lower().endswith(".pdf")]
+        for i,pdf_filename in enumerate(pdf_files):
+            full_text = ""
+            arxiv_id = pdf_filename.split('.pdf')[0]
+            vec_save_loc =  self.vectorstore_path + '/' + arxiv_id
+            if self.summarize and not os.path.exists(vec_save_loc):
+                try:
+                    loader = PyPDFLoader( os.path.join(self.database_path, pdf_filename) )
+                    pages = loader.load()
+                    full_text = "\n".join([p.page_content for p in pages])
+                    if self.process_images:
+                        image_descriptions = extract_and_describe_images( os.path.join(self.database_path, pdf_filename) )
+                        full_text += "\n\n[Image Interpretations]\n" + "\n".join(image_descriptions)
+                except Exception as e:
+                    full_text = f"Error loading paper: {e}"
+            papers.append({
+                "arxiv_id": arxiv_id,
+                "full_text": full_text,
+            })
+        return papers
+    def _fetch_node(self, state: PaperState) -> PaperState:
+        papers = self._fetch_papers(state["query"])
+        return {**state, "papers": papers}
+    def _get_or_build_vectorstore(self, paper_text: str, arxiv_id: str):
+        os.makedirs(self.vectorstore_path, exist_ok=True)
+        persist_directory = os.path.join(self.vectorstore_path, arxiv_id)
+        if os.path.exists(persist_directory):
+            vectorstore = Chroma(persist_directory=persist_directory, embedding_function=self.rag_embedding)
+        else:
+            splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
+            docs = splitter.create_documents([paper_text])
+            vectorstore = Chroma.from_documents(docs, self.rag_embedding, persist_directory=persist_directory)
+        return vectorstore.as_retriever(search_kwargs={"k": 5})
+    def _summarize_node(self, state: PaperState) -> PaperState:
+        prompt = ChatPromptTemplate.from_template("""
+        You are a scientific assistant responsible for summarizing extracts from research papers, in the context of the following task: {context}
+        Summarize the retrieved scientific content below.
+        {retrieved_content}
+        """)
+        chain = prompt | self.llm | StrOutputParser()
+        summaries = [None] * len(state["papers"])
+        relevancy_scores = [0.0] * len(state["papers"])
+        def process_paper(i, paper):
+            arxiv_id = paper["arxiv_id"]
+            summary_filename = os.path.join(self.summaries_path, f"{arxiv_id}_summary.txt")
+            try:
+                cleaned_text = remove_surrogates(paper["full_text"])
+                if self.rag_embedding:
+                    retriever = self._get_or_build_vectorstore(cleaned_text, arxiv_id)
+                    relevant_docs_with_scores = retriever.vectorstore.similarity_search_with_score(state["context"], k=5)
+                    if relevant_docs_with_scores:
+                        score = sum([s for _, s in relevant_docs_with_scores]) / len(relevant_docs_with_scores)
+                        relevancy_scores[i] = abs(1.0 - score)
+                    else:
+                        relevancy_scores[i] = 0.0
+                    retrieved_content = "\n\n".join([doc.page_content for doc, _ in relevant_docs_with_scores])
+                else:
+                    retrieved_content = cleaned_text
+                summary = chain.invoke({"retrieved_content": retrieved_content, "context": state["context"]})
+            except Exception as e:
+                summary = f"Error summarizing paper: {e}"
+                relevancy_scores[i] = 0.0
+            with open(summary_filename, "w") as f:
+                f.write(summary)
+            return i, summary
+        if ('papers' not in state or len(state['papers']) == 0):
+            print(f"No papers retrieved - bad query or network connection to ArXiv?")
+            return {**state, "summaries": None}
+        with ThreadPoolExecutor(max_workers=min(32, len(state["papers"]))) as executor:
+            futures = [executor.submit(process_paper, i, paper) for i, paper in enumerate(state["papers"])]
+            for future in tqdm(as_completed(futures), total=len(futures), desc="Summarizing Papers"):
+                i, result = future.result()
+                summaries[i] = result
+        if self.rag_embedding:
+            print(f"\nMax Relevancy Score: {max(relevancy_scores)}")
+            print(f"Min Relevancy Score: {min(relevancy_scores)}")
+            print(f"Median Relevancy Score: {statistics.median(relevancy_scores)}\n")
+        return {**state, "summaries": summaries}
+    def _aggregate_node(self, state: PaperState) -> PaperState:
+        summaries = state["summaries"]
+        papers = state["papers"]
+        formatted = []
+        if 'summaries' not in state or state['summaries'] is None or 'papers' not in state or state['papers'] is None:
+            return {**state, "final_summary": None}
+        for i, (paper, summary) in enumerate(zip(papers, summaries)):
+            citation = f"[{i+1}] Arxiv ID: {paper['arxiv_id']}"
+            formatted.append(f"{citation}\n\nSummary:\n{summary}")
+        combined = "\n\n" + ("\n\n" + "-" * 40 + "\n\n").join(formatted)
+        with open(self.summaries_path+'/summaries_combined.txt', "w") as f:
+            f.write(combined)
+        prompt = ChatPromptTemplate.from_template("""
+            You are a scientific assistant helping extract insights from summaries of research papers.
+            Here are the summaries of a large number of extracts from scientific papers:
+            {Summaries}
+            Your task is to read all the summaries and provide a response to this task: {context}
+            """)
+        chain = prompt | self.llm | StrOutputParser()
+        final_summary = chain.invoke({"Summaries": combined, "context":state["context"]})
+        with open(self.summaries_path+'/final_summary.txt', "w") as f:
+            f.write(final_summary)
+        return {**state, "final_summary": final_summary}
+    def _build_graph(self):
+        builder = StateGraph(PaperState)
+        builder.add_node("fetch_papers", self._fetch_node)
+        if self.summarize:
+            builder.add_node("summarize_each", self._summarize_node)
+            builder.add_node("aggregate", self._aggregate_node)
+            builder.set_entry_point("fetch_papers")
+            builder.add_edge("fetch_papers", "summarize_each")
+            builder.add_edge("summarize_each", "aggregate")
+            builder.set_finish_point("aggregate")
+        else:
+            builder.set_entry_point("fetch_papers")
+            builder.set_finish_point("fetch_papers")
+        graph = builder.compile()
+        return graph
+    def run(self, arxiv_search_query: str, context: str) -> str:
+        result = self.graph.invoke({"query": arxiv_search_query, "context":context})
+        if self.summarize:
+            return result.get("final_summary", "No summary generated.")
+        else:
+            return "\n\nFinished Fetching papers!"
+if __name__ == "__main__":
+    agent = ArxivAgent()
+    result = agent.run(arxiv_search_query="Experimental Constraints on neutron star radius",
+                       context="What are the constraints on the neutron star radius and what uncertainties are there on the constraints?")
+    print(result)

ursa-ai 0.0.3__tar.gz → 0.2.2__tar.gz

Potentially problematic release.

ursa-ai 0.0.3tar.gz → 0.2.2tar.gz