PyPI - docp - Versions diffs - 0.1.0b1__tar.gz → 0.2.0__tar.gz - Mend

docp 0.1.0b1tar.gz → 0.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (293) hide show

docp-0.2.0/.gitignore ADDED Viewed

@@ -0,0 +1,6 @@
+*__pycache__*
+*.pyc
+# Ignore the development model caches.
+docp/.cache/*

docp-0.2.0/.readthedocs.yaml ADDED Viewed

@@ -0,0 +1,24 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+# Required
+version: 2
+# Set the OS, Python version and other tools you might need
+build:
+  os: ubuntu-lts-latest
+  tools:
+    python: "3.12"
+# Build documentation in the "docs/" directory with Sphinx
+sphinx:
+  builder: html
+  configuration: docs/source/conf.py
+# Optional but recommended, declare the Python requirements required
+# to build your documentation
+# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
+python:
+   install:
+   - requirements: docs/requirements.txt

docp-0.2.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,110 @@
+Metadata-Version: 2.2
+Name: docp
+Version: 0.2.0
+Summary: A basic document parsing and loading utility.
+Author-email: The Developers <development@s3dev.uk>
+License: GNU GPL-3
+Project-URL: Documentation, https://docp.readthedocs.io
+Project-URL: Homepage, https://github.com/s3dev/docp
+Project-URL: Repository, https://github.com/s3dev/docp
+Keywords: document,library,parsing,utility,utilities
+Classifier: Development Status :: 5 - Production/Stable
+Classifier: Intended Audience :: Developers
+Classifier: Intended Audience :: End Users/Desktop
+Classifier: Programming Language :: Python :: 3.7
+Classifier: Programming Language :: Python :: 3.8
+Classifier: Programming Language :: Python :: 3.9
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.10
+Classifier: Programming Language :: Python :: 3.12
+Classifier: Programming Language :: Python :: Implementation :: CPython
+Classifier: Operating System :: POSIX :: Linux
+Classifier: Operating System :: Microsoft :: Windows
+Classifier: Topic :: Software Development
+Classifier: Topic :: Software Development :: Libraries
+Classifier: Topic :: Utilities
+Requires-Python: >=3.7
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: pandas
+Requires-Dist: unidecode
+Requires-Dist: utils4
+# A basic document parsing and loading utility.
+[![PyPI - Version](https://img.shields.io/pypi/v/docp?style=flat-square)](https://pypi.org/project/docp)
+[![PyPI - Implementation](https://img.shields.io/pypi/implementation/docp?style=flat-square)](https://pypi.org/project/docp)
+[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docp?style=flat-square)](https://pypi.org/project/docp)
+[![PyPI - Status](https://img.shields.io/pypi/status/docp?style=flat-square)](https://pypi.org/project/docp)
+[![Static Badge](https://img.shields.io/badge/tests-pending-orange?style=flat-square)](https://pypi.org/project/docp)
+[![Static Badge](https://img.shields.io/badge/code_coverage-pending-orange?style=flat-square)](https://pypi.org/project/docp)
+[![Static Badge](https://img.shields.io/badge/pylint_analysis-100%25-brightgreen?style=flat-square)](https://pypi.org/project/docp)
+[![Documentation Status](https://readthedocs.org/projects/docp/badge/?version=latest&style=flat-square)](https://docp.readthedocs.io/en/latest/)
+[![PyPI - License](https://img.shields.io/pypi/l/docp?style=flat-square)](https://opensource.org/license/gpl-3-0)
+[![PyPI - Wheel](https://img.shields.io/pypi/wheel/docp?style=flat-square)](https://pypi.org/project/docp)
+In its simplest form, the ``docp`` project is a (doc)ument \(p\)arsing library.
+Written in CPython, the project wraps various lower-level libraries, helping to consolidate binary document structure parsing functionality into a single library. Additional functionality includes [document loaders](#loaders) which load a parsed document's embeddings into a Chroma vector database, for RAG-enabled LLM use.
+## Installation
+The easiest way to install ``docp`` is using ``pip`` *after* activating your virtual environment::
+    pip install docp
+Additional (older) releases can be found either at [PyPI](https://pypi.org/project/docp/#history) or in [GitHub Releases](https://github.com/s3dev/docp/releases).
+### A note on the installation of dependencies:
+To keep the installation dependencies to a minimum, only core libraries are required for installation. Meaning, the parser-specific and loader libraries are *not* installed automatically, as part of the ``pip install`` command.
+If a parser is imported and a library is required but not installed, you'll be notified with an easy-to-read message, listing the required dependenc(y|ies).
+The rationale behind this design decision is that not all users will need the document *loading* capability, so ``torch``, ``langchain``,  etc. should not be installed automatically. For example, if your project requires a simple PDF parser, you don't need to (and likely don't want to) 'clutter' your environment with something as heavy as ``torch``, nor make your project dependent on it.
+## The Toolset
+### Parsers
+As of this release, parsers for the following binary document types are supported:
+- PDF
+- MS PowerPoint (PPTX)
+- (more coming soon)
+### Loaders
+In addition to document parsing, document *loading* functionality is built-in as well. Specifically, loading documents into a [Chroma](https://www.trychroma.com) vector database for RAG-enabled LLM ingestion.
+For example, you may wish to load a series of PDF files into a vector database which serves as the backend for a RAG-enabled LLM chatbot. The ``ChromaLoader`` class is specifically designed for this. A single call to the class' loader method results in file retrieval, parsing, splitting, embedding and storage.
+For further detail and usage examples, please refer to the project's [documentation](https://docp.readthedocs.io/).
+## Using the Library
+The documentation suite contains detailed explanation and example usage for each of the library's importable modules. For detailed documentation, usage examples and links the source code itself, please refer to the
+[Library API](https://docp.readthedocs.io/en/latest/library.html) page in the documentation.
+### Quickstart
+For convenience, here are a couple examples for how to parse the supported document types.
+**Extract text from a PDF file:**
+    >>> from docp import PDFParser
+    >>> pdf = PDFParser(path='/path/to/myfile.pdf')
+    >>> pdf.extract_text()
+    # Access the content of page 1.
+    >>> pg1 = pdf.doc.pages[1].content
+**Extract text from a PowerPoint presentation:**
+    >>> from docp import PPTXParser
+    >>> pptx = PPTXParser(path='/path/to/myfile.pptx')
+    >>> pptx.extract_text()
+    # Access the text on slide 1.
+    >>> pg1 = pptx.doc.slides[1].content

docp-0.2.0/README.md ADDED Viewed

@@ -0,0 +1,78 @@
+# A basic document parsing and loading utility.
+[![PyPI - Version](https://img.shields.io/pypi/v/docp?style=flat-square)](https://pypi.org/project/docp)
+[![PyPI - Implementation](https://img.shields.io/pypi/implementation/docp?style=flat-square)](https://pypi.org/project/docp)
+[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docp?style=flat-square)](https://pypi.org/project/docp)
+[![PyPI - Status](https://img.shields.io/pypi/status/docp?style=flat-square)](https://pypi.org/project/docp)
+[![Static Badge](https://img.shields.io/badge/tests-pending-orange?style=flat-square)](https://pypi.org/project/docp)
+[![Static Badge](https://img.shields.io/badge/code_coverage-pending-orange?style=flat-square)](https://pypi.org/project/docp)
+[![Static Badge](https://img.shields.io/badge/pylint_analysis-100%25-brightgreen?style=flat-square)](https://pypi.org/project/docp)
+[![Documentation Status](https://readthedocs.org/projects/docp/badge/?version=latest&style=flat-square)](https://docp.readthedocs.io/en/latest/)
+[![PyPI - License](https://img.shields.io/pypi/l/docp?style=flat-square)](https://opensource.org/license/gpl-3-0)
+[![PyPI - Wheel](https://img.shields.io/pypi/wheel/docp?style=flat-square)](https://pypi.org/project/docp)
+In its simplest form, the ``docp`` project is a (doc)ument \(p\)arsing library.
+Written in CPython, the project wraps various lower-level libraries, helping to consolidate binary document structure parsing functionality into a single library. Additional functionality includes [document loaders](#loaders) which load a parsed document's embeddings into a Chroma vector database, for RAG-enabled LLM use.
+## Installation
+The easiest way to install ``docp`` is using ``pip`` *after* activating your virtual environment::
+    pip install docp
+Additional (older) releases can be found either at [PyPI](https://pypi.org/project/docp/#history) or in [GitHub Releases](https://github.com/s3dev/docp/releases).
+### A note on the installation of dependencies:
+To keep the installation dependencies to a minimum, only core libraries are required for installation. Meaning, the parser-specific and loader libraries are *not* installed automatically, as part of the ``pip install`` command.
+If a parser is imported and a library is required but not installed, you'll be notified with an easy-to-read message, listing the required dependenc(y|ies).
+The rationale behind this design decision is that not all users will need the document *loading* capability, so ``torch``, ``langchain``,  etc. should not be installed automatically. For example, if your project requires a simple PDF parser, you don't need to (and likely don't want to) 'clutter' your environment with something as heavy as ``torch``, nor make your project dependent on it.
+## The Toolset
+### Parsers
+As of this release, parsers for the following binary document types are supported:
+- PDF
+- MS PowerPoint (PPTX)
+- (more coming soon)
+### Loaders
+In addition to document parsing, document *loading* functionality is built-in as well. Specifically, loading documents into a [Chroma](https://www.trychroma.com) vector database for RAG-enabled LLM ingestion.
+For example, you may wish to load a series of PDF files into a vector database which serves as the backend for a RAG-enabled LLM chatbot. The ``ChromaLoader`` class is specifically designed for this. A single call to the class' loader method results in file retrieval, parsing, splitting, embedding and storage.
+For further detail and usage examples, please refer to the project's [documentation](https://docp.readthedocs.io/).
+## Using the Library
+The documentation suite contains detailed explanation and example usage for each of the library's importable modules. For detailed documentation, usage examples and links the source code itself, please refer to the
+[Library API](https://docp.readthedocs.io/en/latest/library.html) page in the documentation.
+### Quickstart
+For convenience, here are a couple examples for how to parse the supported document types.
+**Extract text from a PDF file:**
+    >>> from docp import PDFParser
+    >>> pdf = PDFParser(path='/path/to/myfile.pdf')
+    >>> pdf.extract_text()
+    # Access the content of page 1.
+    >>> pg1 = pdf.doc.pages[1].content
+**Extract text from a PowerPoint presentation:**
+    >>> from docp import PPTXParser
+    >>> pptx = PPTXParser(path='/path/to/myfile.pptx')
+    >>> pptx.extract_text()
+    # Access the text on slide 1.
+    >>> pg1 = pptx.doc.slides[1].content

docp-0.2.0/docp/.cache/.locks/models--sentence-transformers--all-MiniLM-L6-v2/cb202bfe2e3c98645018a6d12f182a434c9d3e02.lock ADDED Viewed

File without changes

docp-0.2.0/docp/.cache/.locks/models--sentence-transformers--all-MiniLM-L6-v2/d1514c3162bbe87b343f565fadc62e6c06f04f03.lock ADDED Viewed

File without changes

docp-0.2.0/docp/.cache/.locks/models--sentence-transformers--all-MiniLM-L6-v2/e7b0375001f109a6b8873d756ad4f7bbb15fbaa5.lock ADDED Viewed

File without changes

docp-0.2.0/docp/.cache/.locks/models--sentence-transformers--all-MiniLM-L6-v2/fb140275c155a9c7c5a3b3e0e77a9e839594a938.lock ADDED Viewed

File without changes

docp-0.2.0/docp/.cache/.locks/models--sentence-transformers--all-MiniLM-L6-v2/fd1b291129c607e5d49799f87cb219b27f98acdf.lock ADDED Viewed

File without changes

docp-0.2.0/docp/__init__.py ADDED Viewed

@@ -0,0 +1,40 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+:Purpose:   This module provides the project initilisation logic.
+:Platform:  Linux/Windows | Python 3.10+
+:Developer: J Berendt
+:Email:     development@s3dev.uk
+:Comments:  Ths loader modules/classes have *not* been imported due to the
+            heavy dependency requirements. Refer to the loaders/__init__.py
+            module instead.
+"""
+import os
+import sys
+sys.path.insert(0, os.path.dirname(os.path.realpath(__file__)))
+from utils4.user_interface import ui
+# locals
+from .libs._version import __version__
+# TODO: Change these to use logging.
+# Bring entry-points to the surface.
+try:
+    from .parsers.pdfparser import PDFParser
+except ImportError as err:
+    msg = ( 'An error occurred while importing the PDF parser:\n'
+           f'- {err}\n'
+            '  - This can be ignored if the parser is not in use.\n')
+    ui.print_warning(f'\n[ImportError]: {msg}')
+try:
+    from .parsers.pptxparser import PPTXParser
+except ImportError as err:
+    msg = ( 'An error occurred while importing the PPTX parser:\n'
+           f'- {err}\n'
+            '  - This can be ignored if the parser is not in use.\n')
+    ui.print_warning(f'\n[ImportError]: {msg}')

docp-0.2.0/docp/dbs/__init__.py ADDED Viewed

File without changes

{docp-0.1.0b1 → docp-0.2.0}/docp/dbs/chroma.py RENAMED Viewed

@@ -10,11 +10,18 @@
 :Developer: J Berendt
 :Email:     development@s3dev.uk
-:Comments:  n/a
+:Comments:  This module uses the
+            ``langchain_community.vectorstores.Chroma`` wrapper class,
+            rather than the base ``chromadb`` library  as it provides the
+            ``add_texts`` method which supports GPU processing and
+            parallelisation; which is implemented by this module's
+            :meth:`~ChromaDB.add_documents` method.
 """
+# pylint: disable=import-error
 # pylint: disable=wrong-import-order
+from __future__ import annotations
 import chromadb
 import os
 import torch
@@ -81,19 +88,25 @@ class ChromaDB(_Chroma):
         """Accessor to the database's path."""
         return self._path
-    def add_documents(self, docs: list):
+    def add_documents(self, docs: list[langchain_core.documents.base.Document]):  # noqa  # pylint: disable=undefined-variable
         """Add multiple documents to the collection.
-        This method wraps ``Chroma.add_texts`` method which supports GPU
-        processing and parallelisation. The ID is derived locally from
-        the file's basename, page number and page content.
+        This method overrides the base class' ``add_documents`` method
+        to enable local ID derivation. Knowing *how* the IDs are derived
+        gives us greater understanding and querying ability of the
+        documents in the database. Each ID is derived locally by the
+        :meth:`_preproc` method from the file's basename, page number
+        and page content.
+        Additionally, this method wraps the
+        :func:`langchain_community.vectorstores.Chroma.add_texts`
+        method which supports GPU processing and parallelisation.
         Args:
             docs (list): A list of ``langchain_core.documents.base.Document``
                 document objects.
         """
-        # This method overrides the base class' add_documents method.
         # pylint: disable=arguments-differ
         # pylint: disable=arguments-renamed
         if not isinstance(docs, list):

docp-0.2.0/docp/libs/_version.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = '0.2.0'

docp-0.2.0/docp/libs/changelog.py ADDED Viewed

@@ -0,0 +1,7 @@
+# Changed.
+# ENABLE SPHINX TO ACCESS THE GIT LOG
+"""
+.. git_changelog::
+    :revisions: 99
+    :detailed-message-pre: True
+"""

docp-0.2.0/docp/libs/utilities.py ADDED Viewed

@@ -0,0 +1,107 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+:Purpose:   This module provides utility-based functionality for the
+            project.
+:Platform:  Linux/Windows | Python 3.10+
+:Developer: J Berendt
+:Email:     development@s3dev.uk
+:Comments:  n/a
+"""
+import os
+import sys
+sys.path.insert(0, os.path.join(os.path.dirname(os.path.realpath(__file__)), '../../'))
+import re
+from glob import glob
+from utils4 import futils
+class Utilities:
+    """General (cross-project) utility functions."""
+    @staticmethod
+    def collect_files(path: str, ext: str, recursive: bool) -> list:
+        """Collect all files for a given extension from a path.
+        Args:
+            path (str): Full path serving as the root for the search.
+            ext (str, optional): If the ``path`` argument refers to a
+                *directory*, a specific file extension can be specified
+                here. For example: ``ext = 'pdf'``.
+                If anything other than ``'**'`` is provided, all
+                alpha-characters are parsed from the string, and prefixed
+                with ``*.``. Meaning, if ``'.pdf'`` is passed, the
+                characters ``'pdf'`` are parsed and prefixed with ``*.``
+                to create ``'*.pdf'``. However, if ``'things.foo'`` is
+                passed, the derived extension will be ``'*.thingsfoo'``.
+                Defaults to '**', for a recursive search.
+            recursive (bool): Instruct the search to recurse into
+                sub-directories.
+        Returns:
+            list: The list of full file paths returned by the ``glob``
+            call. Any directory-only paths are removed.
+        """
+        if ext != '**':
+            ext = f'*.{re.findall("[a-zA-Z]+", ext)[0]}'
+        return list(filter(os.path.isfile, glob(os.path.join(path, ext), recursive=recursive)))
+    # !!!: Replace this with utils4.futils when available.
+    @staticmethod
+    def ispdf(path: str) -> bool:
+        """Test the file signature. Verify this is a valid PDF file.
+        Args:
+            path (str): Path to the file being tested.
+        Returns:
+            bool: True if this is a valid PDF file, otherwise False.
+        """
+        with open(path, 'rb') as f:
+            sig = f.read(5)
+        return sig == b'\x25\x50\x44\x46\x2d'
+    @staticmethod
+    def iszip(path: str) -> bool:
+        """Test the file signature. Verify this is a valid ZIP archive.
+        Args:
+            path (str): Path to the file being tested.
+        Returns:
+            bool: True if this is a valid ZIP archive, otherwise False.
+        """
+        return futils.iszip(path)
+    @staticmethod
+    def parse_to_keywords(resp: str) -> list:
+        """Parse the bot's response into a list of keywords.
+        Args:
+            resp (str): Text response directly from the bot.
+        Returns:
+            list: A list of keywords extracted from the response,
+            separated by asterisks as bullet points.
+        """
+        # Capture asterisk bullet points or a numbered list.
+        rexp = re.compile(r'(?:\*|[0-9]+\.)\s*(.*)\n')
+        trans = {45: ' ', 47: ' '}
+        resp_ = resp.translate(trans).lower()
+        kwds = rexp.findall(resp_)
+        if kwds:
+            return ', '.join(kwds)
+        return ''
+utilities = Utilities()

docp-0.2.0/docp/loaders/__init__.py ADDED Viewed

@@ -0,0 +1,38 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+:Purpose:   This module provides the project initilisation logic.
+:Platform:  Linux/Windows | Python 3.10+
+:Developer: J Berendt
+:Email:     development@s3dev.uk
+:Comments:  n/a
+"""
+import os
+import sys
+sys.path.insert(0, os.path.dirname(os.path.realpath(__file__)))
+from utils4.user_interface import ui
+# TODO: Change these to use logging.
+# Bring entry-points to the surface.
+try:
+    from .chromapdfloader import ChromaPDFLoader
+except ImportError as err:
+    # The chroma loader requires a lot of backend which is not required for the parser.
+    msg = ( 'An error occurred while importing the Chroma PDF loader:\n'
+           f'- {err}\n'
+            '  - This can be ignored if the loader is not in use.\n')
+    ui.print_warning(f'\n[ImportError]: {msg}')
+try:
+    from .chromapptxloader import ChromaPPTXLoader
+except ImportError as err:
+    # The chroma loader requires a lot of backend which is not required for the parser.
+    msg = ( 'An error occurred while importing the Chroma PPTX loader:\n'
+           f'- {err}\n'
+            '  - This can be ignored if the loader is not in use.\n')
+    ui.print_warning(f'\n[ImportError]: {msg}')

docp 0.1.0b1__tar.gz → 0.2.0__tar.gz

docp 0.1.0b1tar.gz → 0.2.0tar.gz