pembot 0.0.3__py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of pembot might be problematic. Click here for more details.

Files changed (129) hide show
  1. pembot/.git/COMMIT_EDITMSG +1 -0
  2. pembot/.git/HEAD +1 -0
  3. pembot/.git/config +11 -0
  4. pembot/.git/description +1 -0
  5. pembot/.git/hooks/applypatch-msg.sample +15 -0
  6. pembot/.git/hooks/commit-msg.sample +24 -0
  7. pembot/.git/hooks/fsmonitor-watchman.sample +174 -0
  8. pembot/.git/hooks/post-update.sample +8 -0
  9. pembot/.git/hooks/pre-applypatch.sample +14 -0
  10. pembot/.git/hooks/pre-commit.sample +49 -0
  11. pembot/.git/hooks/pre-merge-commit.sample +13 -0
  12. pembot/.git/hooks/pre-push.sample +53 -0
  13. pembot/.git/hooks/pre-rebase.sample +169 -0
  14. pembot/.git/hooks/pre-receive.sample +24 -0
  15. pembot/.git/hooks/prepare-commit-msg.sample +42 -0
  16. pembot/.git/hooks/push-to-checkout.sample +78 -0
  17. pembot/.git/hooks/sendemail-validate.sample +77 -0
  18. pembot/.git/hooks/update.sample +128 -0
  19. pembot/.git/index +0 -0
  20. pembot/.git/info/exclude +6 -0
  21. pembot/.git/logs/HEAD +6 -0
  22. pembot/.git/logs/refs/heads/main +6 -0
  23. pembot/.git/logs/refs/remotes/origin/HEAD +1 -0
  24. pembot/.git/logs/refs/remotes/origin/main +5 -0
  25. pembot/.git/objects/0a/fb3a98cdc55b1434b44534ec2bf22c56cfa26c +0 -0
  26. pembot/.git/objects/0c/8d9b2690545bf1906b05cd9f18b783b3eb74f1 +0 -0
  27. pembot/.git/objects/18/28e18ab80aa64d334b26428708140e280cbc63 +0 -0
  28. pembot/.git/objects/19/f61df7dbd562d04f561288677bbf2f18f5dff7 +0 -0
  29. pembot/.git/objects/28/db0ab48059acccd7d257aa02e52e9b6b83a4a5 +0 -0
  30. pembot/.git/objects/35/97e518a8658280be9f377f78edf1dfa1f23814 +0 -0
  31. pembot/.git/objects/3d/07d3b29ff53d95de3898fb786d61732f210515 +0 -0
  32. pembot/.git/objects/3e/cf23eb95123287531d708a21d4ba88d92ccabb +0 -0
  33. pembot/.git/objects/3f/78215d7e17da726fb352fd92b3c117db9b63ba +0 -0
  34. pembot/.git/objects/3f/e072cf3cb6a9f30c3e9936e3ddf622e80270d0 +0 -0
  35. pembot/.git/objects/51/9e780574933d7627a083222bd10dd74f430904 +0 -0
  36. pembot/.git/objects/61/46a371b9c1bd9f51af273f11f986cfd1bedeba +0 -0
  37. pembot/.git/objects/64/00040794955d17c9a1fe1aaaea59f2c4822177 +0 -0
  38. pembot/.git/objects/6d/7a865a23b1cb4182f67907820104ced48b11c9 +0 -0
  39. pembot/.git/objects/72/f047cda92abcd1ddc857f6461de605f8668331 +0 -0
  40. pembot/.git/objects/73/2e98f08bc806c331b06847fc8c743f545499e5 +0 -0
  41. pembot/.git/objects/86/cdaec229f1fbebf43042266b03878944669f25 +0 -0
  42. pembot/.git/objects/87/d6df5217a4a374f8c1211a05f9bd657f72c9a7 +0 -0
  43. pembot/.git/objects/8b/5be2af9b16f290549193859c214cd9072212e8 +0 -0
  44. pembot/.git/objects/93/8f29d9b4b1ae86e39dddf9e3d115a82ddfc9b6 +0 -0
  45. pembot/.git/objects/9b/123713e30fc9e225f9ac8ff5b02f8f8cf86456 +0 -0
  46. pembot/.git/objects/ab/c6b15265171457b41e2cfdaf3b8c3994a59eb7 +0 -0
  47. pembot/.git/objects/ac/9c9018c62fa30dc142665c1b5a375f4e056880 +0 -0
  48. pembot/.git/objects/b1/1173d9b68db117437ccb9551461152e1e8a77d +0 -0
  49. pembot/.git/objects/b2/4e79ab07fe9e68781961a25ff9f1dbb1546fbb +0 -0
  50. pembot/.git/objects/b8/eea52176ffa4d88c5a9976bee26092421565d3 +0 -0
  51. pembot/.git/objects/bf/32a7e6872e5dc4025ee3df3c921ec7ade0855f +0 -0
  52. pembot/.git/objects/c0/793458db6e1bee7f79f1a504fb8ff4963f8ed3 +0 -0
  53. pembot/.git/objects/c2/443060c07101948487cfa93cc39e082e9e0f5f +0 -0
  54. pembot/.git/objects/e5/3070f2b07f45d031444b09b1b38658f3caf29e +0 -0
  55. pembot/.git/objects/e7/911a702079a6144997ea4e70f59abbe59ec2bc +0 -0
  56. pembot/.git/objects/e9/1172752e9a421ae463112d2b0506b37498c98d +0 -0
  57. pembot/.git/objects/ea/0af89e61a882c5afc2a8c281b2d96f174bfe58 +0 -0
  58. pembot/.git/objects/eb/75e1c49f1e5b79dca17ccdbec8067756523238 +0 -0
  59. pembot/.git/objects/f1/655afa1c5636c8d58969e3194bb770aefbc552 +0 -0
  60. pembot/.git/objects/f4/e991088a63def67a30a2b8bbdb4d58514abab8 +0 -0
  61. pembot/.git/objects/f8/cbb5bfd1503e66cec2c593362c60a317b6d300 +0 -0
  62. pembot/.git/objects/f9/98e1f01c2bf0a20159fc851327af05beb3ac88 +0 -0
  63. pembot/.git/objects/fa/9c9a62ec1203a5868b033ded428c2382c4e1b6 +0 -0
  64. pembot/.git/objects/fb/6c90c9ce5e0cdfbe074a3f060afc66f62eefde +0 -0
  65. pembot/.git/objects/fc/e56f1e09d09a05b9babf796fb40bece176f3a2 +0 -0
  66. pembot/.git/objects/pack/pack-d5469edc8c36e3bb1de5e0070e4d5b1eae935dd4.idx +0 -0
  67. pembot/.git/objects/pack/pack-d5469edc8c36e3bb1de5e0070e4d5b1eae935dd4.pack +0 -0
  68. pembot/.git/objects/pack/pack-d5469edc8c36e3bb1de5e0070e4d5b1eae935dd4.rev +0 -0
  69. pembot/.git/packed-refs +2 -0
  70. pembot/.git/refs/heads/main +1 -0
  71. pembot/.git/refs/remotes/origin/HEAD +1 -0
  72. pembot/.git/refs/remotes/origin/main +1 -0
  73. pembot/.gitignore +7 -0
  74. pembot/AnyToText/__init__.py +0 -0
  75. pembot/AnyToText/convertor.py +260 -0
  76. pembot/LICENSE +674 -0
  77. pembot/TextEmbedder/__init__.py +0 -0
  78. pembot/TextEmbedder/gemini_embedder.py +27 -0
  79. pembot/TextEmbedder/mongodb_embedder.py +258 -0
  80. pembot/TextEmbedder/mongodb_index_creator.py +133 -0
  81. pembot/TextEmbedder/vector_query.py +64 -0
  82. pembot/__init__.py +6 -0
  83. pembot/config/config.yaml +5 -0
  84. pembot/gartner.py +140 -0
  85. pembot/main.py +208 -0
  86. pembot/output_structure_local.py +63 -0
  87. pembot/pdf2markdown/.git/HEAD +1 -0
  88. pembot/pdf2markdown/.git/config +11 -0
  89. pembot/pdf2markdown/.git/description +1 -0
  90. pembot/pdf2markdown/.git/hooks/applypatch-msg.sample +15 -0
  91. pembot/pdf2markdown/.git/hooks/commit-msg.sample +24 -0
  92. pembot/pdf2markdown/.git/hooks/fsmonitor-watchman.sample +174 -0
  93. pembot/pdf2markdown/.git/hooks/post-update.sample +8 -0
  94. pembot/pdf2markdown/.git/hooks/pre-applypatch.sample +14 -0
  95. pembot/pdf2markdown/.git/hooks/pre-commit.sample +49 -0
  96. pembot/pdf2markdown/.git/hooks/pre-merge-commit.sample +13 -0
  97. pembot/pdf2markdown/.git/hooks/pre-push.sample +53 -0
  98. pembot/pdf2markdown/.git/hooks/pre-rebase.sample +169 -0
  99. pembot/pdf2markdown/.git/hooks/pre-receive.sample +24 -0
  100. pembot/pdf2markdown/.git/hooks/prepare-commit-msg.sample +42 -0
  101. pembot/pdf2markdown/.git/hooks/push-to-checkout.sample +78 -0
  102. pembot/pdf2markdown/.git/hooks/sendemail-validate.sample +77 -0
  103. pembot/pdf2markdown/.git/hooks/update.sample +128 -0
  104. pembot/pdf2markdown/.git/index +0 -0
  105. pembot/pdf2markdown/.git/info/exclude +6 -0
  106. pembot/pdf2markdown/.git/logs/HEAD +1 -0
  107. pembot/pdf2markdown/.git/logs/refs/heads/main +1 -0
  108. pembot/pdf2markdown/.git/logs/refs/remotes/origin/HEAD +1 -0
  109. pembot/pdf2markdown/.git/objects/pack/pack-d3051affdd6c31306dc53489168fc870872085d1.idx +0 -0
  110. pembot/pdf2markdown/.git/objects/pack/pack-d3051affdd6c31306dc53489168fc870872085d1.pack +0 -0
  111. pembot/pdf2markdown/.git/objects/pack/pack-d3051affdd6c31306dc53489168fc870872085d1.rev +0 -0
  112. pembot/pdf2markdown/.git/packed-refs +2 -0
  113. pembot/pdf2markdown/.git/refs/heads/main +1 -0
  114. pembot/pdf2markdown/.git/refs/remotes/origin/HEAD +1 -0
  115. pembot/pdf2markdown/LICENSE +21 -0
  116. pembot/pdf2markdown/README.md +107 -0
  117. pembot/pdf2markdown/__init__.py +0 -0
  118. pembot/pdf2markdown/config/config.yaml +2 -0
  119. pembot/pdf2markdown/extract.py +888 -0
  120. pembot/pdf2markdown/requirements.txt +8 -0
  121. pembot/pem.py +157 -0
  122. pembot/query.py +204 -0
  123. pembot/utils/__init__.py +0 -0
  124. pembot/utils/inference_client.py +132 -0
  125. pembot/utils/string_tools.py +45 -0
  126. pembot-0.0.3.dist-info/METADATA +8 -0
  127. pembot-0.0.3.dist-info/RECORD +129 -0
  128. pembot-0.0.3.dist-info/WHEEL +5 -0
  129. pembot-0.0.3.dist-info/licenses/LICENSE +674 -0
@@ -0,0 +1,107 @@
1
+ # PDF to Markdown Extractor
2
+
3
+ ## Table of Contents
4
+
5
+ - [Objective](#objective)
6
+ - [Features](#features)
7
+ - [Requirements](#requirements)
8
+ - [Installation](#installation)
9
+ - [Usage](#usage)
10
+ - [Performance and Accuracy](#performance-and-accuracy)
11
+ - [Limitations](#limitations)
12
+ - [Use in Downstream Tasks](#use-in-downstream-tasks)
13
+ - [Contributing](#contributing)
14
+ - [License](#license)
15
+
16
+ ## Objective
17
+
18
+ This project aims to extract markdown-formatted content from PDF files, specifically designed for downstream tasks such as Retrieval Augmented Generation (RAG). It preserves various markdown elements such as tables, images, links, bold and italic text, blockquotes, code blocks, and other markdown-specific syntax. The script utilizes Python libraries like PyMuPDF (fitz), pdfplumber, pytesseract, and others to achieve accurate extraction and conversion, focusing solely on converting PDF files to Markdown format.
19
+
20
+ ## Features
21
+
22
+ - Extracts text, images, tables, and code blocks from PDF files
23
+ - Converts PDF content to markdown format optimized for RAG and other NLP tasks
24
+ - Preserves formatting for bold, italic, tables, images, links, lists, and code blocks
25
+ - Handles complex layouts including multi-column text
26
+ - Performs OCR on images to extract text
27
+ - Generates image captions using a pre-trained model
28
+ - Outputs clean, structured markdown suitable for information retrieval and text generation tasks
29
+
30
+ ## Requirements
31
+
32
+ - Python 3.8+
33
+ - PyMuPDF (fitz)
34
+ - pdfplumber
35
+ - pytesseract
36
+ - OpenCV (cv2)
37
+ - numpy
38
+ - Pillow (PIL)
39
+ - transformers
40
+ - torch
41
+
42
+ ## Installation
43
+
44
+ 1. Clone the repository:
45
+ ```
46
+ git clone https://github.com/iamarunbrahma/pdf-to-markdown.git
47
+ cd pdf-to-markdown
48
+ ```
49
+
50
+ 2. Create a virtual environment (optional but recommended):
51
+ ```
52
+ python -m venv venv
53
+ source venv/bin/activate
54
+ ```
55
+
56
+ 3. Install the required packages:
57
+ ```
58
+ pip install -r requirements.txt
59
+ ```
60
+
61
+ 4. Install Tesseract OCR:
62
+ - On Ubuntu: `sudo apt-get install tesseract-ocr`
63
+ - On macOS: `brew install tesseract`
64
+ - On Windows: Download and install from [GitHub](https://github.com/UB-Mannheim/tesseract/wiki)
65
+
66
+ ## Usage
67
+
68
+ Run the script with the path to your PDF file as an argument:
69
+
70
+ ```
71
+ python extract.py --pdf_path path/to/your/file.pdf
72
+ ```
73
+
74
+ The extracted markdown content will be saved in the `outputs` directory with the same name as the input PDF file, but with a `.md` extension.
75
+
76
+ ## Performance and Accuracy
77
+
78
+ The script is designed to handle various PDF layouts and content types, with a focus on producing high-quality markdown for downstream NLP tasks:
79
+
80
+ - **Accuracy**: The extractor aims for high accuracy in preserving the original document's structure and formatting. It handles common elements like text, tables, images, links, and code blocks well, ensuring the output is suitable for tasks like RAG. However, very complex layouts or PDFs with non-standard formatting might require manual review.
81
+
82
+ - **Speed**: The processing time depends on the PDF's size and complexity. On average, for a 10-page PDF with mixed content (text, images, tables, and code blocks), the extraction process typically takes about 30-60 seconds on a modern computer.
83
+
84
+ - **Optimization for RAG**: The output is structured to facilitate easy parsing and chunking for RAG systems, with clear delineation between different sections and content types.
85
+
86
+ ## Limitations
87
+
88
+ - This tool is specifically designed for PDF to Markdown conversion and does not handle other file formats.
89
+ - Very large PDFs (100+ pages) may require significant processing time.
90
+ - PDFs with complex mathematical formulas or specialized symbols may not be perfectly converted.
91
+ - Scanned PDFs without embedded text will rely on OCR, which may not be 100% accurate.
92
+
93
+ ## Use in Downstream Tasks
94
+
95
+ The markdown output from this extractor is particularly well-suited for:
96
+
97
+ 1. **Retrieval Augmented Generation (RAG)**: The structured markdown can be easily indexed and retrieved, providing context for language models in RAG systems.
98
+ 2. **Text Summarization**: Clean, well-formatted markdown facilitates more accurate summarization of document content.
99
+ 3. **Information Extraction**: The preserved structure aids in extracting specific information from documents.
100
+
101
+ ## Contributing
102
+
103
+ Contributions to improve the extractor's accuracy, speed, or feature set are welcome, especially those that enhance its utility for RAG and other NLP tasks. Please feel free to submit issues or pull requests.
104
+
105
+ ## License
106
+
107
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
File without changes
@@ -0,0 +1,2 @@
1
+ PAGE_DELIMITER: <||WXb23TXrUn3Rxz00yNNr89HV||>
2
+ OUTPUT_DIR: outputs