PyPI - vectoriz - Versions diffs - 1.0.0__tar.gz → 1.0.1__tar.gz - Mend

vectoriz 1.0.0tar.gz → 1.0.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

{vectoriz-1.0.0 → vectoriz-1.0.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: vectoriz
-Version: 1.0.0
+Version: 1.0.1
 Summary: Python library for creating vectorized data from text or files.
 Home-page: https://github.com/PedroHenriqueDevBR/vectoriz
 Author: PedroHenriqueDevBR
@@ -25,19 +25,19 @@ Dynamic: summary
 # Vectoriz
-[![PyPI version](https://badge.fury.io/py/vectoriz.svg)](https://pypi.org/project/vectoriz/)
-[![GitHub license](https://img.shields.io/github/license/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/blob/main/LICENSE)
-[![Python Version](https://img.shields.io/badge/python-3.12%2B-blue)](https://www.python.org/downloads/)
-[![GitHub issues](https://img.shields.io/github/issues/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/issues)
+Vectoriz is available on PyPI and can be installed via pip:
-[![GitHub stars](https://img.shields.io/github/stars/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/stargazers)
+<div align="center">
+    <a href="https://pypi.org/project/vectoriz/"><img src="https://badge.fury.io/py/vectoriz.svg" alt="PyPI version"></a>
+    <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.12%2B-blue" alt="Python Version"></a>
+    <a href="https://github.com/PedroHenriqueDevBR/vectoriz/issues"><img src="https://img.shields.io/github/issues/PedroHenriqueDevBR/vectoriz" alt="GitHub issues"></a>
+    <a href="https://github.com/PedroHenriqueDevBR/vectoriz/stargazers"><img src="https://img.shields.io/github/stars/PedroHenriqueDevBR/vectoriz" alt="GitHub stars"></a>
+    <a href="https://github.com/PedroHenriqueDevBR/vectoriz/network"><img src="https://img.shields.io/github/forks/PedroHenriqueDevBR/vectoriz" alt="GitHub forks"></a>
+</div>
-[![GitHub forks](https://img.shields.io/github/forks/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/network)
+## Project description
-Vectoriz is available on PyPI and can be installed via pip:
+For install another versions you can go to: <a href="https://pypi.org/project/vectoriz/#history">PyPI versions</a>
 ```bash
 pip install vectoriz

{vectoriz-1.0.0 → vectoriz-1.0.1}/README.md RENAMED Viewed

@@ -1,18 +1,18 @@
 # Vectoriz
-[![PyPI version](https://badge.fury.io/py/vectoriz.svg)](https://pypi.org/project/vectoriz/)
-[![GitHub license](https://img.shields.io/github/license/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/blob/main/LICENSE)
-[![Python Version](https://img.shields.io/badge/python-3.12%2B-blue)](https://www.python.org/downloads/)
-[![GitHub issues](https://img.shields.io/github/issues/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/issues)
+Vectoriz is available on PyPI and can be installed via pip:
-[![GitHub stars](https://img.shields.io/github/stars/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/stargazers)
+<div align="center">
+    <a href="https://pypi.org/project/vectoriz/"><img src="https://badge.fury.io/py/vectoriz.svg" alt="PyPI version"></a>
+    <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.12%2B-blue" alt="Python Version"></a>
+    <a href="https://github.com/PedroHenriqueDevBR/vectoriz/issues"><img src="https://img.shields.io/github/issues/PedroHenriqueDevBR/vectoriz" alt="GitHub issues"></a>
+    <a href="https://github.com/PedroHenriqueDevBR/vectoriz/stargazers"><img src="https://img.shields.io/github/stars/PedroHenriqueDevBR/vectoriz" alt="GitHub stars"></a>
+    <a href="https://github.com/PedroHenriqueDevBR/vectoriz/network"><img src="https://img.shields.io/github/forks/PedroHenriqueDevBR/vectoriz" alt="GitHub forks"></a>
+</div>
-[![GitHub forks](https://img.shields.io/github/forks/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/network)
+## Project description
-Vectoriz is available on PyPI and can be installed via pip:
+For install another versions you can go to: <a href="https://pypi.org/project/vectoriz/#history">PyPI versions</a>
 ```bash
 pip install vectoriz

{vectoriz-1.0.0 → vectoriz-1.0.1}/setup.py RENAMED Viewed

@@ -2,7 +2,7 @@ from setuptools import setup, find_packages
 setup(
     name="vectoriz",
-    version="1.0.0",
+    version="1.0.1",
     author="PedroHenriqueDevBR",
     author_email="pedro.henrique.particular@gmail.com",
     description="Python library for creating vectorized data from text or files.",

{vectoriz-1.0.0 → vectoriz-1.0.1}/tests/test_files.py RENAMED Viewed

@@ -163,3 +163,107 @@ class TestFilesFeature:
         files_feature = FilesFeature()
         result = files_feature._extract_docx_content(str(tmp_path), "empty.docx")
         assert result == ""
+    def test_extract_markdown_content_reads_file_correctly(self, tmp_path):
+        test_content = "# Markdown Title\nThis is some markdown content."
+        test_file = tmp_path / "test.md"
+        test_file.write_text(test_content)
+        files_feature = FilesFeature()
+        result = files_feature._extract_markdown_content(str(tmp_path), "test.md")
+        assert result == test_content
+    def test_extract_markdown_content_with_unicode_chars(self, tmp_path):
+        test_content = "# Unicode Title\nContent with unicode: àáâãäåæç"
+        test_file = tmp_path / "unicode.md"
+        test_file.write_text(test_content, encoding="utf-8")
+        files_feature = FilesFeature()
+        result = files_feature._extract_markdown_content(str(tmp_path), "unicode.md")
+        assert result == test_content
+    def test_extract_markdown_content_raises_file_not_found(self):
+        files_feature = FilesFeature()
+        with pytest.raises(FileNotFoundError):
+            files_feature._extract_markdown_content(
+                "/non_existent_dir", "non_existent_file.md"
+            )
+    def test_extract_markdown_content_handles_empty_file(self, tmp_path):
+        test_file = tmp_path / "empty.md"
+        test_file.write_text("")
+        files_feature = FilesFeature()
+        result = files_feature._extract_markdown_content(str(tmp_path), "empty.md")
+        assert result == ""
+    def test_extract_markdown_content_raises_unicode_decode_error(self, tmp_path):
+        test_file = tmp_path / "invalid_encoding.md"
+        test_file.write_bytes(b"\x80\x81\x82")  # Invalid UTF-8 bytes
+        files_feature = FilesFeature()
+        with pytest.raises(UnicodeDecodeError):
+            files_feature._extract_markdown_content(str(tmp_path), "invalid_encoding.md")
+    def test_load_markdown_files_from_directory_loads_files_correctly(self, tmp_path):
+        test_content_1 = "# Title 1\nContent 1"
+        test_content_2 = "# Title 2\nContent 2"
+        test_file_1 = tmp_path / "file1.md"
+        test_file_2 = tmp_path / "file2.md"
+        test_file_1.write_text(test_content_1)
+        test_file_2.write_text(test_content_2)
+        files_feature = FilesFeature()
+        result = files_feature.load_markdown_files_from_directory(str(tmp_path))
+        assert len(result.chunk_names) == 2
+        assert len(result.text_list) == 2
+        assert test_file_1.name in result.chunk_names
+        assert test_file_2.name in result.chunk_names
+        assert test_content_1 in result.text_list
+        assert test_content_2 in result.text_list
+    def test_load_markdown_files_from_directory_skips_non_markdown_files(self, tmp_path):
+        test_content_md = "# Markdown Content"
+        test_content_txt = "Text Content"
+        test_file_md = tmp_path / "file.md"
+        test_file_txt = tmp_path / "file.txt"
+        test_file_md.write_text(test_content_md)
+        test_file_txt.write_text(test_content_txt)
+        files_feature = FilesFeature()
+        result = files_feature.load_markdown_files_from_directory(str(tmp_path))
+        assert len(result.chunk_names) == 1
+        assert len(result.text_list) == 1
+        assert test_file_md.name in result.chunk_names
+        assert test_content_md in result.text_list
+        assert test_file_txt.name not in result.chunk_names
+    def test_load_markdown_files_from_directory_handles_empty_directory(self, tmp_path):
+        files_feature = FilesFeature()
+        result = files_feature.load_markdown_files_from_directory(str(tmp_path))
+        assert len(result.chunk_names) == 0
+        assert len(result.text_list) == 0
+    def test_load_markdown_files_from_directory_handles_empty_markdown_file(self, tmp_path):
+        test_file = tmp_path / "empty.md"
+        test_file.write_text("")
+        files_feature = FilesFeature()
+        result = files_feature.load_markdown_files_from_directory(str(tmp_path))
+        assert len(result.chunk_names) == 1
+        assert len(result.text_list) == 1
+        assert test_file.name in result.chunk_names
+        assert result.text_list[0] == ""
+    def test_load_markdown_files_from_directory_with_verbose_output(self, tmp_path, capsys):
+        test_content = "# Markdown Content"
+        test_file = tmp_path / "file.md"
+        test_file.write_text(test_content)
+        files_feature = FilesFeature()
+        files_feature.load_markdown_files_from_directory(str(tmp_path), verbose=True)
+        captured = capsys.readouterr()
+        assert "Loaded Markdown file: file.md" in captured.out

{vectoriz-1.0.0 → vectoriz-1.0.1}/tests/test_token_transformer.py RENAMED Viewed

@@ -159,6 +159,21 @@ class TestTokenTransformer:
         assert len(result1.strip().split("\n")) == 1
         assert len(result3.strip().split("\n")) == 3
         assert len(result5.strip().split("\n")) == 5
+    def test_search_with_list_arg(self):
+        transformer = TokenTransformer()
+        texts = ["Doc 1", "Doc 2", "Doc 3", "Doc 4", "Doc 5"]
+        embeddings = transformer.text_to_embeddings(texts)
+        index = transformer.embeddings_to_index(embeddings)
+        result1 = transformer.search("Doc", index, texts, context_amount=1, as_list=True)
+        result3 = transformer.search("Doc", index, texts, context_amount=3, as_list=True)
+        result5 = transformer.search("Doc", index, texts, context_amount=5, as_list=True)
+        assert len(result1) == 1
+        assert len(result3) == 3
+        assert len(result5) == 5
     def test_create_index(self):
         transformer = TokenTransformer()

{vectoriz-1.0.0 → vectoriz-1.0.1}/vectoriz/files.py RENAMED Viewed

@@ -101,6 +101,36 @@ class FilesFeature:
         with open(os.path.join(directory, file), "r", encoding="utf-8") as fl:
             text = fl.read()
             return text
+    def _extract_markdown_content(self, directory: str, file: str) -> Optional[str]:
+        """
+        Extract content from a Markdown file and add it to the response data.
+        This method opens a Markdown file in read mode with UTF-8 encoding, reads its content,
+        and adds the file name and its content to the response data.
+        Parameters:
+        ----------
+        directory : str
+            The directory path where the file is located.
+        file : str
+            The name of the Markdown file to read.
+        Returns:
+        -------
+        Optional[str]
+            The content of the Markdown file or None if the file is empty.
+        Raises:
+        ------
+        FileNotFoundError
+            If the specified file does not exist.
+        UnicodeDecodeError
+            If the file cannot be decoded using UTF-8 encoding.
+        """
+        with open(os.path.join(directory, file), "r", encoding="utf-8") as fl:
+            text = fl.read()
+            return text
     def _extract_docx_content(self, directory: str, file: str) -> Optional[str]:
         """
@@ -195,6 +225,41 @@ class FilesFeature:
             if verbose:
                 print(f"Loaded Word file: {file}")
         return argument
+    def load_markdown_files_from_directory(self, directory: str, verbose: bool = False) -> FileArgument:
+        """
+        Load all Markdown (.md) files from the specified directory and extract their content.
+        This method iterates through all files in the given directory, identifies those
+        with a .md extension, and processes them using the extract_markdown_content method.
+        Args:
+            directory (str): Path to the directory containing Markdown files to be processed
+        Returns:
+            None
+        Examples:
+            >>> processor = DocumentProcessor()
+            >>> processor.load_markdown_files("/path/to/documents")
+        """
+        argument: FileArgument = FileArgument([], [], [])
+        for file in os.listdir(directory):
+            if not file.endswith(".md"):
+                if verbose:
+                    print(f"Error file: {file}")
+                continue
+            text = self._extract_markdown_content(directory, file)
+            if text is None:
+                if verbose:
+                    print(f"Error file: {file}")
+                continue
+            argument.add_data(file, text)
+            if verbose:
+                print(f"Loaded Markdown file: {file}")
+        return argument
     def load_all_files_from_directory(self, directory: str, verbose: bool =  False) -> FileArgument:
         """

{vectoriz-1.0.0 → vectoriz-1.0.1}/vectoriz/token_transformer.py RENAMED Viewed

@@ -1,6 +1,6 @@
 import faiss
 import numpy as np
-from typing import Self
+from typing import Self, Union
 from sentence_transformers import SentenceTransformer
@@ -76,7 +76,8 @@ class TokenTransformer:
         index: faiss.IndexFlatL2,
         texts: list[str],
         context_amount: int = 1,
-    ) -> str:
+        as_list: bool = False,
+    ) -> Union[str, list[str]]:
         """
         Searches for the most similar texts to the given query using the provided FAISS index.
         This method converts the query into an embedding, searches for the k nearest neighbors
@@ -97,6 +98,9 @@ class TokenTransformer:
         _, I = index.search(query_embedding, k=context_amount)
         context = ""
+        if as_list:
+            return [texts[i].strip() for i in I[0]]
         for i in I[0]:
             context += texts[i] + "\n"

{vectoriz-1.0.0 → vectoriz-1.0.1}/vectoriz.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: vectoriz
-Version: 1.0.0
+Version: 1.0.1
 Summary: Python library for creating vectorized data from text or files.
 Home-page: https://github.com/PedroHenriqueDevBR/vectoriz
 Author: PedroHenriqueDevBR
@@ -25,19 +25,19 @@ Dynamic: summary
 # Vectoriz
-[![PyPI version](https://badge.fury.io/py/vectoriz.svg)](https://pypi.org/project/vectoriz/)
-[![GitHub license](https://img.shields.io/github/license/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/blob/main/LICENSE)
-[![Python Version](https://img.shields.io/badge/python-3.12%2B-blue)](https://www.python.org/downloads/)
-[![GitHub issues](https://img.shields.io/github/issues/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/issues)
+Vectoriz is available on PyPI and can be installed via pip:
-[![GitHub stars](https://img.shields.io/github/stars/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/stargazers)
+<div align="center">
+    <a href="https://pypi.org/project/vectoriz/"><img src="https://badge.fury.io/py/vectoriz.svg" alt="PyPI version"></a>
+    <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.12%2B-blue" alt="Python Version"></a>
+    <a href="https://github.com/PedroHenriqueDevBR/vectoriz/issues"><img src="https://img.shields.io/github/issues/PedroHenriqueDevBR/vectoriz" alt="GitHub issues"></a>
+    <a href="https://github.com/PedroHenriqueDevBR/vectoriz/stargazers"><img src="https://img.shields.io/github/stars/PedroHenriqueDevBR/vectoriz" alt="GitHub stars"></a>
+    <a href="https://github.com/PedroHenriqueDevBR/vectoriz/network"><img src="https://img.shields.io/github/forks/PedroHenriqueDevBR/vectoriz" alt="GitHub forks"></a>
+</div>
-[![GitHub forks](https://img.shields.io/github/forks/PedroHenriqueDevBR/vectoriz)](https://github.com/PedroHenriqueDevBR/vectoriz/network)
+## Project description
-Vectoriz is available on PyPI and can be installed via pip:
+For install another versions you can go to: <a href="https://pypi.org/project/vectoriz/#history">PyPI versions</a>
 ```bash
 pip install vectoriz