pembot 0.0.5__py2.py3-none-any.whl → 0.0.6__py2.py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of pembot might be problematic. Click here for more details.

Files changed (42) hide show
  1. pembot/.git/COMMIT_EDITMSG +1 -1
  2. pembot/.git/index +0 -0
  3. pembot/.git/logs/HEAD +1 -0
  4. pembot/.git/logs/refs/heads/main +1 -0
  5. pembot/.git/logs/refs/remotes/origin/main +1 -0
  6. pembot/.git/objects/3e/23850624fcf5f111d6ea88ddd64adf924cf82f +0 -0
  7. pembot/.git/objects/4d/a03134f70896f72053fbdc0cd4f4c76d4ac1d8 +0 -0
  8. pembot/.git/objects/95/28bbccd167e3f4ad583a1ae9fac98a52620e27 +0 -0
  9. pembot/.git/objects/bd/8fd1cb166996e74a8631f3a6f764a53af75297 +0 -0
  10. pembot/.git/objects/bf/518686b06069d2a8abd3689908b7e1a6e16b05 +0 -0
  11. pembot/.git/objects/e0/9162dbd64d85bb5ed740aa99faefa73f293d78 +0 -0
  12. pembot/.git/refs/heads/main +1 -1
  13. pembot/.git/refs/remotes/origin/main +1 -1
  14. pembot/AnyToText/convertor.py +8 -6
  15. pembot/__init__.py +1 -1
  16. pembot/config/config.yaml +1 -1
  17. pembot/pdf2markdown/.git/COMMIT_EDITMSG +1 -0
  18. pembot/pdf2markdown/.git/config +3 -0
  19. pembot/pdf2markdown/.git/index +0 -0
  20. pembot/pdf2markdown/.git/logs/HEAD +3 -0
  21. pembot/pdf2markdown/.git/logs/refs/heads/main +3 -0
  22. pembot/pdf2markdown/.git/logs/refs/remotes/myorigin/main +3 -0
  23. pembot/pdf2markdown/.git/objects/14/251b198e0bac39a3dc3b42f9e57b20c01465fb +0 -0
  24. pembot/pdf2markdown/.git/objects/24/8f03b5f969a7fbd396b496f40b57f0ae81c148 +0 -0
  25. pembot/pdf2markdown/.git/objects/57/74dc9c3901d2ffb2cd7dafe2ad6612a7f9f42c +0 -0
  26. pembot/pdf2markdown/.git/objects/72/2dc14f82e78ce41717348b256e0c17834933b4 +0 -0
  27. pembot/pdf2markdown/.git/objects/79/eb7b93ced70e399bd561093c45de7641414dbd +0 -0
  28. pembot/pdf2markdown/.git/objects/8d/9ce1fd9733a78c592b34af9c94b98960c601ed +0 -0
  29. pembot/pdf2markdown/.git/objects/95/745843bb4377d6042180daeda818c0b16fd493 +0 -0
  30. pembot/pdf2markdown/.git/objects/a5/c6dfb577782c259990dcf977e355298e923428 +0 -0
  31. pembot/pdf2markdown/.git/objects/b4/8d697aa9fd97151eb2a84a1af5d408b7630232 +0 -0
  32. pembot/pdf2markdown/.git/objects/b8/702320e56074e9680181d8b7897d6a0a552e2d +0 -0
  33. pembot/pdf2markdown/.git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 +0 -0
  34. pembot/pdf2markdown/.git/refs/heads/main +1 -1
  35. pembot/pdf2markdown/.git/refs/remotes/myorigin/main +1 -0
  36. pembot/pdf2markdown/extract.py +58 -90
  37. pembot/pdf2markdown/pyrightconfig.json +4 -0
  38. pembot/requirements.txt +80 -0
  39. {pembot-0.0.5.dist-info → pembot-0.0.6.dist-info}/METADATA +1 -1
  40. {pembot-0.0.5.dist-info → pembot-0.0.6.dist-info}/RECORD +42 -20
  41. {pembot-0.0.5.dist-info → pembot-0.0.6.dist-info}/WHEEL +0 -0
  42. {pembot-0.0.5.dist-info → pembot-0.0.6.dist-info}/licenses/LICENSE +0 -0
@@ -1 +1 @@
1
- fixed the output_dir bug; fixed the excel to json function; ran some tests on convertor; incremented the version on the package; removed dependency on schema / structure, and shifted required fields to a pickle file path in the cli args;
1
+ handled config loading errors gracefully; added gemini support, as an option; added huggingface nanonets transformers support (as an option); redesigned the extract markdown for captioning and image ocr (block image and full-page image);
pembot/.git/index CHANGED
Binary file
pembot/.git/logs/HEAD CHANGED
@@ -5,3 +5,4 @@ ac9c9018c62fa30dc142665c1b5a375f4e056880 72f047cda92abcd1ddc857f6461de605f866833
5
5
  e91172752e9a421ae463112d2b0506b37498c98d 0c8d9b2690545bf1906b05cd9f18b783b3eb74f1 cyto <silverstone965@gmail.com> 1749716350 +0530 commit: added a pem blog chunking module for updating from local, and, an embedding loop to embed all the blogs, with document id as the filter in the search, and the first line title as the filter in updation
6
6
  0c8d9b2690545bf1906b05cd9f18b783b3eb74f1 eb75e1c49f1e5b79dca17ccdbec8067756523238 cyto <silverstone965@gmail.com> 1750856653 +0530 commit: made arrangements for the cases when custom file bytes are to be processed to text output; handled a ollama running / crashing error
7
7
  eb75e1c49f1e5b79dca17ccdbec8067756523238 0bdb4169fc0f312b8698f1df17a258fff163aeaa cyto <silverstone965@gmail.com> 1750937276 +0530 commit: fixed the output_dir bug; fixed the excel to json function; ran some tests on convertor; incremented the version on the package; removed dependency on schema / structure, and shifted required fields to a pickle file path in the cli args;
8
+ 0bdb4169fc0f312b8698f1df17a258fff163aeaa 9528bbccd167e3f4ad583a1ae9fac98a52620e27 cyto <silverstone965@gmail.com> 1750947488 +0530 commit: handled local llm nonexistent error properly for choice of just passing None as llm_client;
@@ -5,3 +5,4 @@ ac9c9018c62fa30dc142665c1b5a375f4e056880 72f047cda92abcd1ddc857f6461de605f866833
5
5
  e91172752e9a421ae463112d2b0506b37498c98d 0c8d9b2690545bf1906b05cd9f18b783b3eb74f1 cyto <silverstone965@gmail.com> 1749716350 +0530 commit: added a pem blog chunking module for updating from local, and, an embedding loop to embed all the blogs, with document id as the filter in the search, and the first line title as the filter in updation
6
6
  0c8d9b2690545bf1906b05cd9f18b783b3eb74f1 eb75e1c49f1e5b79dca17ccdbec8067756523238 cyto <silverstone965@gmail.com> 1750856653 +0530 commit: made arrangements for the cases when custom file bytes are to be processed to text output; handled a ollama running / crashing error
7
7
  eb75e1c49f1e5b79dca17ccdbec8067756523238 0bdb4169fc0f312b8698f1df17a258fff163aeaa cyto <silverstone965@gmail.com> 1750937276 +0530 commit: fixed the output_dir bug; fixed the excel to json function; ran some tests on convertor; incremented the version on the package; removed dependency on schema / structure, and shifted required fields to a pickle file path in the cli args;
8
+ 0bdb4169fc0f312b8698f1df17a258fff163aeaa 9528bbccd167e3f4ad583a1ae9fac98a52620e27 cyto <silverstone965@gmail.com> 1750947488 +0530 commit: handled local llm nonexistent error properly for choice of just passing None as llm_client;
@@ -4,3 +4,4 @@ ac9c9018c62fa30dc142665c1b5a375f4e056880 72f047cda92abcd1ddc857f6461de605f866833
4
4
  e91172752e9a421ae463112d2b0506b37498c98d 0c8d9b2690545bf1906b05cd9f18b783b3eb74f1 cyto <silverstone965@gmail.com> 1749716371 +0530 update by push
5
5
  0c8d9b2690545bf1906b05cd9f18b783b3eb74f1 eb75e1c49f1e5b79dca17ccdbec8067756523238 cyto <silverstone965@gmail.com> 1750856672 +0530 update by push
6
6
  eb75e1c49f1e5b79dca17ccdbec8067756523238 0bdb4169fc0f312b8698f1df17a258fff163aeaa cyto <silverstone965@gmail.com> 1750937389 +0530 update by push
7
+ 0bdb4169fc0f312b8698f1df17a258fff163aeaa 9528bbccd167e3f4ad583a1ae9fac98a52620e27 cyto <silverstone965@gmail.com> 1750947502 +0530 update by push
@@ -1 +1 @@
1
- 0bdb4169fc0f312b8698f1df17a258fff163aeaa
1
+ 9528bbccd167e3f4ad583a1ae9fac98a52620e27
@@ -1 +1 @@
1
- 0bdb4169fc0f312b8698f1df17a258fff163aeaa
1
+ 9528bbccd167e3f4ad583a1ae9fac98a52620e27
@@ -35,6 +35,8 @@ class Convertor():
35
35
 
36
36
  self.output= ""
37
37
 
38
+ # model_name= "gemini-2.5-flash"
39
+ model_name= None
38
40
  # file_type can be pdf, excel, etc.
39
41
  if output_dir is None and myfile is None and file_bytes is not None and suffix is not None:
40
42
  with tempfile.TemporaryDirectory() as dp:
@@ -43,7 +45,7 @@ class Convertor():
43
45
  myfile= Path(fp.name)
44
46
  output_dir= Path(dp)
45
47
  if file_type == 'pdf':
46
- extractor= MarkdownPDFExtractor(str(myfile), output_path= str(output_dir), page_delimiter= "-- NEXT PAGE --")
48
+ extractor= MarkdownPDFExtractor(str(myfile), output_path= str(output_dir), page_delimiter= "-- NEXT PAGE --", model_name= model_name)
47
49
  extractor.extract()
48
50
  with open(output_dir / (myfile.stem + '.md')) as output_file:
49
51
  self.output= output_file.read()
@@ -67,7 +69,7 @@ class Convertor():
67
69
  print("the file was json")
68
70
  elif mt == 'application/pdf':
69
71
  print("the file was pdf, outputting in: ", output_dir)
70
- extractor= MarkdownPDFExtractor(str(myfile), output_path= str(self.output_dir), page_delimiter= "-- NEXT PAGE --")
72
+ extractor= MarkdownPDFExtractor(str(myfile), output_path= str(self.output_dir), page_delimiter= "-- NEXT PAGE --", model_name= model_name)
71
73
  extractor.extract()
72
74
 
73
75
  elif mt in EXCEL_FILE_TYPES:
@@ -333,10 +335,10 @@ def chunk_text(text, chunk_size=500, overlap_size=50):
333
335
  if __name__ == '__main__':
334
336
  print("Test Run Start:")
335
337
  try:
336
- # print("Test 1: scaned pdf page, bytes")
337
- # with open("/home/cyto/Documents/scanned.pdf", "rb") as imgpdf:
338
- # conv= Convertor(file_bytes= imgpdf.read(), suffix= ".pdf", file_type= "pdf")
339
- # print(conv.output)
338
+ print("Test 1: scaned pdf page, bytes")
339
+ with open("/home/cyto/Documents/scanned.pdf", "rb") as imgpdf:
340
+ conv= Convertor(file_bytes= imgpdf.read(), suffix= ".pdf", file_type= "pdf")
341
+ print(conv.output)
340
342
 
341
343
  # print("Test 2: JD pdf, bytes")
342
344
  # with open("/home/cyto/dev/pembotdir/jds/PM Trainee.pdf", "rb") as imgpdf:
pembot/__init__.py CHANGED
@@ -1,6 +1,6 @@
1
1
  """
2
2
  A Python Package to convert PEM blog content to usseful information by leveraging LLMs
3
3
  """
4
- __version__ = '0.0.5'
4
+ __version__ = '0.0.6'
5
5
  from .main import save_to_json_file, make_query
6
6
  __all__ = ["save_to_json_file", "make_query"]
pembot/config/config.yaml CHANGED
@@ -2,4 +2,4 @@ OUTPUT_DIR: /home/cyto/dev/pembotdir
2
2
  PAGE_DELIMITER: ___________________________ NEXT PAGE ___________________________
3
3
  app:
4
4
  name: pembot
5
- version: 0.0.5
5
+ version: 0.0.6
@@ -0,0 +1 @@
1
+ cyto/argument-list-bug-fix;authentication-used-in-gradio-client
@@ -9,3 +9,6 @@
9
9
  [branch "main"]
10
10
  remote = origin
11
11
  merge = refs/heads/main
12
+ [remote "myorigin"]
13
+ url = https://github.com/silverstone-git/pdf-to-markdown.git
14
+ fetch = +refs/heads/*:refs/remotes/myorigin/*
Binary file
@@ -1 +1,4 @@
1
1
  0000000000000000000000000000000000000000 ffb759ee4605b232366a9ee58134532913c3f9e0 cyto <cyto@callisto.localdomain> 1747745478 +0530 clone: from https://github.com/iamarunbrahma/pdf-to-markdown
2
+ ffb759ee4605b232366a9ee58134532913c3f9e0 b8702320e56074e9680181d8b7897d6a0a552e2d cyto <silverstone965@gmail.com> 1750947962 +0530 commit: handled config loading errors gracefully; added gemini support, as an option; added huggingface nanonets transformers support (as an option); redesigned the extract markdown for captioning and image ocr (block image and full-page image);
3
+ b8702320e56074e9680181d8b7897d6a0a552e2d 14251b198e0bac39a3dc3b42f9e57b20c01465fb cyto <silverstone965@gmail.com> 1751604763 +0530 commit: removed deps on torch and transformers; used gradio client for ocr through public spaces;
4
+ 14251b198e0bac39a3dc3b42f9e57b20c01465fb b48d697aa9fd97151eb2a84a1af5d408b7630232 cyto <silverstone965@gmail.com> 1751871887 +0530 commit: cyto/argument-list-bug-fix;authentication-used-in-gradio-client
@@ -1 +1,4 @@
1
1
  0000000000000000000000000000000000000000 ffb759ee4605b232366a9ee58134532913c3f9e0 cyto <cyto@callisto.localdomain> 1747745478 +0530 clone: from https://github.com/iamarunbrahma/pdf-to-markdown
2
+ ffb759ee4605b232366a9ee58134532913c3f9e0 b8702320e56074e9680181d8b7897d6a0a552e2d cyto <silverstone965@gmail.com> 1750947962 +0530 commit: handled config loading errors gracefully; added gemini support, as an option; added huggingface nanonets transformers support (as an option); redesigned the extract markdown for captioning and image ocr (block image and full-page image);
3
+ b8702320e56074e9680181d8b7897d6a0a552e2d 14251b198e0bac39a3dc3b42f9e57b20c01465fb cyto <silverstone965@gmail.com> 1751604763 +0530 commit: removed deps on torch and transformers; used gradio client for ocr through public spaces;
4
+ 14251b198e0bac39a3dc3b42f9e57b20c01465fb b48d697aa9fd97151eb2a84a1af5d408b7630232 cyto <silverstone965@gmail.com> 1751871887 +0530 commit: cyto/argument-list-bug-fix;authentication-used-in-gradio-client
@@ -0,0 +1,3 @@
1
+ 0000000000000000000000000000000000000000 b8702320e56074e9680181d8b7897d6a0a552e2d cyto <silverstone965@gmail.com> 1750948073 +0530 update by push
2
+ b8702320e56074e9680181d8b7897d6a0a552e2d 14251b198e0bac39a3dc3b42f9e57b20c01465fb cyto <silverstone965@gmail.com> 1751604904 +0530 update by push
3
+ 14251b198e0bac39a3dc3b42f9e57b20c01465fb b48d697aa9fd97151eb2a84a1af5d408b7630232 cyto <silverstone965@gmail.com> 1751872077 +0530 update by push
@@ -1 +1 @@
1
- ffb759ee4605b232366a9ee58134532913c3f9e0
1
+ b48d697aa9fd97151eb2a84a1af5d408b7630232
@@ -0,0 +1 @@
1
+ b48d697aa9fd97151eb2a84a1af5d408b7630232
@@ -2,11 +2,9 @@ import fitz
2
2
  import pdfplumber
3
3
  import re
4
4
  import yaml
5
- # import pytesseract
5
+ import pytesseract
6
6
  import numpy as np
7
- from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText, VisionEncoderDecoderModel, ViTImageProcessor
8
7
  from typing import Literal, final
9
- import torch
10
8
  from PIL import Image
11
9
  import os
12
10
  import logging
@@ -19,6 +17,9 @@ import io
19
17
  from google import genai
20
18
  from google.genai import types
21
19
  import mimetypes
20
+ from gradio_client import Client, handle_file
21
+ import gradio as gr
22
+ import tempfile
22
23
 
23
24
 
24
25
 
@@ -75,25 +76,18 @@ class MarkdownPDFExtractor(PDFExtractor):
75
76
  super().__init__(pdf_path)
76
77
 
77
78
  if model_name is None:
78
- self.MODEL_NAME= "gemini-2.5-flash"
79
+ # self.MODEL_NAME= "gemini-2.5-flash"
80
+ self.MODEL_NAME= "Nanonets-OCR-s"
79
81
  else:
80
82
  self.MODEL_NAME= model_name
81
83
 
82
84
  if "gemini" in self.MODEL_NAME:
83
85
  self.gclient = genai.Client(api_key= os.getenv("GEMINI_API_KEY", ''))
84
- else:
85
- model_path = "nanonets/Nanonets-OCR-s"
86
- self.model = AutoModelForImageTextToText.from_pretrained(
87
- model_path,
88
- torch_dtype="auto",
89
- device_map="auto",
90
- attn_implementation="flash_attention_2"
91
- )
92
- self.model.eval()
93
- self.tokenizer = AutoTokenizer.from_pretrained(model_path)
94
- self.processor = AutoProcessor.from_pretrained(model_path)
95
- self.setup_image_captioning()
86
+ elif "anonet" in self.MODEL_NAME:
87
+ # self.nclient= Client("prithivMLmods/Multimodal-OCR2")
96
88
 
89
+ # zerogpu public
90
+ self.nclient= Client("deepak-mehta/ocr-simplify", hf_token= os.getenv('HF_TOKEN', ''))
97
91
 
98
92
 
99
93
  self.markdown_content= ""
@@ -108,25 +102,6 @@ class MarkdownPDFExtractor(PDFExtractor):
108
102
 
109
103
 
110
104
 
111
- def setup_image_captioning(self):
112
- """Set up the image captioning model."""
113
- try:
114
- self.model = VisionEncoderDecoderModel.from_pretrained(
115
- "nlpconnect/vit-gpt2-image-captioning"
116
- )
117
- self.feature_extractor = ViTImageProcessor.from_pretrained(
118
- "nlpconnect/vit-gpt2-image-captioning"
119
- )
120
- self.tokenizer = AutoTokenizer.from_pretrained(
121
- "nlpconnect/vit-gpt2-image-captioning"
122
- )
123
- self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
124
- self.model.to(self.device)
125
- self.logger.info("Image captioning model set up successfully.")
126
- except Exception as e:
127
- self.logger.error(f"Error setting up image captioning model: {e}")
128
- self.logger.exception(traceback.format_exc())
129
-
130
105
  def extract(self):
131
106
  try:
132
107
  markdown_content, markdown_pages = self.extract_markdown()
@@ -143,12 +118,18 @@ class MarkdownPDFExtractor(PDFExtractor):
143
118
  return "", []
144
119
 
145
120
 
146
- def ocr_page_with_nanonets_s(self, pil_image, img_bytes, max_new_tokens: int | None = None):
147
- prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
121
+ def image_ocr(self, pil_image, img_bytes, max_new_tokens: int | None = None, prompt: str | None= None):
122
+ if prompt is None:
123
+ prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
148
124
  if max_new_tokens is None:
149
125
  max_new_tokens= 4096
150
126
 
151
- if 'gemini' in self.MODEL_NAME:
127
+ w, h= pil_image.size
128
+ if w < 200 or h < 50:
129
+ return "<img> A small image </img>"
130
+
131
+ model_name= self.MODEL_NAME.lower()
132
+ if 'gemini' in model_name:
152
133
 
153
134
  image_format = pil_image.format
154
135
  dummy_filename = f"dummy.{image_format.lower()}"
@@ -165,24 +146,40 @@ class MarkdownPDFExtractor(PDFExtractor):
165
146
  )
166
147
  # print("response :", response)
167
148
  return response.text
168
- else:
169
- image = pil_image
170
- messages = [
171
- {"role": "system", "content": "You are a helpful assistant."},
172
- {"role": "user", "content": [
173
- {"type": "image", "image": image},
174
- {"type": "text", "text": prompt},
175
- ]},
176
- ]
177
- text = self.processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
178
- inputs = self.processor(text=[text], images=[image], padding=True, return_tensors="pt")
179
- inputs = inputs.to(self.model.device)
149
+ elif 'nanonet' in model_name:
180
150
 
181
- output_ids = self.model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
182
- generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
183
-
184
- output_text = self.processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
185
- return output_text[0]
151
+ result= ""
152
+ try:
153
+ with tempfile.NamedTemporaryFile(suffix=f'.{pil_image.format.lower()}', mode= 'w') as temp_file:
154
+ pil_image.save(temp_file.name)
155
+ print("file name: ", temp_file.name)
156
+ gr_image= handle_file(temp_file.name)
157
+ print("gr image : ", gr_image)
158
+ result = self.nclient.predict(
159
+ # model_name="Nanonets-OCR-s",
160
+ # text= prompt,
161
+ gr_image,
162
+ # max_new_tokens=max_new_tokens,
163
+ # temperature=0.6,
164
+ # top_p=0.9,
165
+ # top_k=50,
166
+ # repetition_penalty=1.2,
167
+
168
+ # prithiv model
169
+ # api_name="/generate_image"
170
+
171
+ max_new_tokens,
172
+
173
+ # spaces zerogpu
174
+ api_name="/predict"
175
+ )
176
+ print("ocr'd: ", result[:100] + "...")
177
+ except Exception as e:
178
+ print("Error during nanonet inference", e)
179
+
180
+ return result
181
+ else:
182
+ return pytesseract.image_to_string(pil_image)
186
183
 
187
184
 
188
185
 
@@ -219,7 +216,7 @@ class MarkdownPDFExtractor(PDFExtractor):
219
216
  for page_num, page in enumerate(doc):
220
217
  current_page_markdown_blocks = [] # Collect markdown blocks for the current page
221
218
  page_has_searchable_text = False
222
- page_has_embedded_images = False
219
+ # page_has_embedded_images = False
223
220
 
224
221
  self.logger.info(f"\nProcessing page {page_num + 1}...")
225
222
 
@@ -252,7 +249,7 @@ class MarkdownPDFExtractor(PDFExtractor):
252
249
  try:
253
250
  image_bytes= io.BytesIO(img_data)
254
251
  pil_image = Image.open(image_bytes)
255
- ocr_text_from_block_image = self.ocr_page_with_nanonets_s(
252
+ ocr_text_from_block_image = self.image_ocr(
256
253
  pil_image, image_bytes, max_new_tokens=15000
257
254
  )
258
255
 
@@ -292,7 +289,7 @@ class MarkdownPDFExtractor(PDFExtractor):
292
289
  image_bytestream= io.BytesIO(img_bytes)
293
290
  pil_image = Image.open(image_bytestream)
294
291
 
295
- ocr_text_from_page = self.ocr_page_with_nanonets_s(
292
+ ocr_text_from_page = self.image_ocr(
296
293
  pil_image, image_bytestream, max_new_tokens=15000
297
294
  )
298
295
 
@@ -389,7 +386,7 @@ class MarkdownPDFExtractor(PDFExtractor):
389
386
  # ocr_result = pytesseract.image_to_string(
390
387
  # image
391
388
  # )
392
- ocr_result= self.ocr_page_with_nanonets_s(image, image_bytes, max_new_tokens=15000)
389
+ ocr_result= self.image_ocr(image, image_bytes, max_new_tokens=15000)
393
390
 
394
391
 
395
392
  return ocr_result.strip()
@@ -409,38 +406,9 @@ class MarkdownPDFExtractor(PDFExtractor):
409
406
  if image.mode != "RGB":
410
407
  image = image.convert("RGB")
411
408
 
412
- image_format = image.format
413
- dummy_filename = f"dummy.{image_format.lower()}"
414
- mime_type, _ = mimetypes.guess_type(dummy_filename)
415
-
416
- if "gemini" in self.MODEL_NAME:
417
- response= self.gclient.models.generate_content(
418
- model= self.MODEL_NAME,
419
- contents=[
420
- types.Part.from_bytes(
421
- data=image_bytes.getvalue(),
422
- mime_type= mime_type
423
- ),
424
- "Write a caption for this image"
425
- ]
426
- )
427
- return response.text
428
- else:
429
- # Ensure the image is in the correct shape
430
- image = np.array(image).transpose(2, 0, 1) # Convert to (C, H, W) format
431
-
432
- inputs = self.feature_extractor(images=image, return_tensors="pt").to(
433
- self.device
434
- )
435
- pixel_values = inputs.pixel_values
436
-
437
- generated_ids = self.model.generate(pixel_values, max_length=30)
409
+ caption= self.image_ocr(image, image_bytes, max_new_tokens=15000, prompt= "Write a caption for this image")
410
+ return caption
438
411
 
439
- generated_ids = self.model.generate(pixel_values, max_length=30)
440
- generated_caption = self.tokenizer.batch_decode(
441
- generated_ids, skip_special_tokens=True
442
- )[0]
443
- return generated_caption.strip()
444
412
  except Exception as e:
445
413
  self.logger.error(f"Error captioning image: {e}")
446
414
  self.logger.exception(traceback.format_exc())
@@ -0,0 +1,4 @@
1
+ {
2
+ "venvPath": "../..",
3
+ "venv": "venvpem"
4
+ }
@@ -0,0 +1,80 @@
1
+ aiofiles==24.1.0
2
+ annotated-types==0.7.0
3
+ anyio==4.9.0
4
+ audioop-lts==0.2.1
5
+ cachetools==5.5.2
6
+ certifi==2025.6.15
7
+ cffi==1.17.1
8
+ charset-normalizer==3.4.2
9
+ click==8.2.1
10
+ cryptography==45.0.5
11
+ dnspython==2.7.0
12
+ et_xmlfile==2.0.0
13
+ fastapi==0.115.14
14
+ ffmpy==0.6.0
15
+ filelock==3.18.0
16
+ fsspec==2025.5.1
17
+ google-auth==2.40.3
18
+ google-genai==1.24.0
19
+ gradio==5.35.0
20
+ gradio_client==1.10.4
21
+ greenlet==3.2.3
22
+ groovy==0.1.2
23
+ h11==0.16.0
24
+ hf-xet==1.1.5
25
+ httpcore==1.0.9
26
+ httpx==0.28.1
27
+ huggingface-hub==0.33.2
28
+ idna==3.10
29
+ Jinja2==3.1.6
30
+ markdown-it-py==3.0.0
31
+ MarkupSafe==3.0.2
32
+ mdurl==0.1.2
33
+ msgpack==1.1.1
34
+ numpy==2.3.1
35
+ ollama==0.5.1
36
+ openpyxl==3.1.5
37
+ orjson==3.10.18
38
+ packaging==25.0
39
+ pandas==2.3.0
40
+ pathlib==1.0.1
41
+ pdfminer.six==20250506
42
+ pdfplumber==0.11.7
43
+ pembot==0.0.6
44
+ pillow==11.3.0
45
+ pyasn1==0.6.1
46
+ pyasn1_modules==0.4.2
47
+ pycparser==2.22
48
+ pydantic==2.11.7
49
+ pydantic_core==2.33.2
50
+ pydub==0.25.1
51
+ Pygments==2.19.2
52
+ pymongo==4.13.2
53
+ PyMuPDF==1.26.3
54
+ pynvim==0.5.2
55
+ pypdfium2==4.30.1
56
+ pytesseract==0.3.13
57
+ python-dateutil==2.9.0.post0
58
+ python-multipart==0.0.20
59
+ pytz==2025.2
60
+ PyYAML==6.0.2
61
+ requests==2.32.4
62
+ rich==14.0.0
63
+ rsa==4.9.1
64
+ ruff==0.12.1
65
+ safehttpx==0.1.6
66
+ semantic-version==2.10.0
67
+ shellingham==1.5.4
68
+ six==1.17.0
69
+ sniffio==1.3.1
70
+ starlette==0.46.2
71
+ tenacity==8.5.0
72
+ tomlkit==0.13.3
73
+ tqdm==4.67.1
74
+ typer==0.16.0
75
+ typing-inspection==0.4.1
76
+ typing_extensions==4.14.0
77
+ tzdata==2025.2
78
+ urllib3==2.5.0
79
+ uvicorn==0.35.0
80
+ websockets==15.0.1
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: pembot
3
- Version: 0.0.5
3
+ Version: 0.0.6
4
4
  Summary: A Python Package to convert PEM blog content to usseful information by leveraging LLMs
5
5
  Author-email: cyto <aryan_sidhwani@protonmail.com>
6
6
  License-Expression: MIT
@@ -1,16 +1,17 @@
1
1
  pembot/.gitignore,sha256=_7FTsZokJ_pzEyyPjOsGw5x5Xx3gUBFaafs7UlPsv9E,98
2
2
  pembot/LICENSE,sha256=OXLcl0T2SZ8Pmy2_dmlvKuetivmyPd5m1q-Gyd-zaYY,35149
3
- pembot/__init__.py,sha256=8YYLSe42l8VZcVXDH-vxQKFUGbki1Q8JbnkindAmEMs,211
3
+ pembot/__init__.py,sha256=s4fd-1t1D43kkQi_78FmX_7hi-NsBfqtc2BHwNrMHtw,211
4
4
  pembot/gartner.py,sha256=3ALknQ5mSXIimmwCa3JFDzB_EW2hHEcQO1T2odyBquk,5408
5
5
  pembot/main.py,sha256=lZLIV8XPonvNoY4LVS-5fct1y9URMXWoSGJUKMw3Yg8,9667
6
6
  pembot/output_structure_local.py,sha256=YfpHzfTNeLMSsB_CjAamha9D6Iz7E1IC-tW9xPCMWFc,3000
7
7
  pembot/pem.py,sha256=mv6iGcN1peSY7z2dtCQ_BKj31EFBNfczBhps_d-0XDo,6377
8
8
  pembot/query.py,sha256=D1RPRoImDWCafbshT2NpO4ymVj2RySm8j5FJ5bRzYWw,8476
9
- pembot/.git/COMMIT_EDITMSG,sha256=Tm5a80gXZb68HaSqG7HGQWXscLh79ki2ekRF5swa3lw,238
9
+ pembot/requirements.txt,sha256=6OV_n5JVco2lLA8Wq38tJX1bYgo_UU0R9RKgs4d2wfc,1360
10
+ pembot/.git/COMMIT_EDITMSG,sha256=HR106qWTNcQKmC8LAIwmZ9A9YBTENaUYQy3UtJmK0XY,238
10
11
  pembot/.git/HEAD,sha256=KNJb-Cr0wOK3L1CVmyvrhZ4-YLljCl6MYD2tTdsrboA,21
11
12
  pembot/.git/config,sha256=ZFl9d2GyxirgRXRsv8iULIieKxwGC9P6SAjB_AmTkmQ,271
12
13
  pembot/.git/description,sha256=hatsFj1DoX6pz3eIMIvKFGbxsKjRzJLibpv2PaQGKu4,73
13
- pembot/.git/index,sha256=08t-REdJo1OXpli82XwFCMECzmOhGxRuHdZgRyD1wnk,1814
14
+ pembot/.git/index,sha256=9R33jd4OjVHXzQElukY8zpNSB7v6vW4j4GMcpNBT5bo,1814
14
15
  pembot/.git/packed-refs,sha256=7DECsr7q7vJ6Gw6a2gS3dE4v-YzbxGiWYoSWM43DgsQ,112
15
16
  pembot/.git/hooks/applypatch-msg.sample,sha256=AiNJeguLAzqlijpSG4YphpOGz3qw4vEBlj0yiqYhk_c,478
16
17
  pembot/.git/hooks/commit-msg.sample,sha256=H3TV6SkpebVz69WXQdRsuT_zkazdCD00C5Q3B1PZJDc,896
@@ -27,10 +28,10 @@ pembot/.git/hooks/push-to-checkout.sample,sha256=pT0HQXmLKHxt16-mSu5HPzBeZdP0lGO
27
28
  pembot/.git/hooks/sendemail-validate.sample,sha256=ROv8kj3FRmvACWAvDs8Ge5xlRZq_6IaN3Em3jmztepI,2308
28
29
  pembot/.git/hooks/update.sample,sha256=jV8vqD4QPPCLV-qmdSHfkZT0XL28s32lKtWGCXoU0QY,3650
29
30
  pembot/.git/info/exclude,sha256=ZnH-g7egfIky7okWTR8nk7IxgFjri5jcXAbuClo7DsE,240
30
- pembot/.git/logs/HEAD,sha256=sAWs-cIFQY0k-nZzMnbgHnfYWnvRiZA4PGHifOTzx-c,1929
31
- pembot/.git/logs/refs/heads/main,sha256=sAWs-cIFQY0k-nZzMnbgHnfYWnvRiZA4PGHifOTzx-c,1929
31
+ pembot/.git/logs/HEAD,sha256=crGP01FLAqdksSr1razn4_Aa5devc2MaSbfStzWV4Os,2160
32
+ pembot/.git/logs/refs/heads/main,sha256=crGP01FLAqdksSr1razn4_Aa5devc2MaSbfStzWV4Os,2160
32
33
  pembot/.git/logs/refs/remotes/origin/HEAD,sha256=OrkNquczPPh6fEGtutFKva_-_JhAdwnvXpCCPC4N6jk,194
33
- pembot/.git/logs/refs/remotes/origin/main,sha256=TzZ8B_7cB9ODuU2sqJ_2omvV7houIBI8mq-7SrWFyNY,876
34
+ pembot/.git/logs/refs/remotes/origin/main,sha256=5cKDe0WKpvOSobN6UHTaj0is1mUWZMz0xjzyBSz1l2s,1022
34
35
  pembot/.git/objects/0a/fb3a98cdc55b1434b44534ec2bf22c56cfa26c,sha256=Xxw20vI57zuhERWopDAZpQw6rAOhFtUr05lzpGyCTTE,120
35
36
  pembot/.git/objects/0b/db4169fc0f312b8698f1df17a258fff163aeaa,sha256=hsOHhX0Yajg27Y7B9lo-WjDXzW1KNMg2CBr93G116EY,387
36
37
  pembot/.git/objects/0c/8d9b2690545bf1906b05cd9f18b783b3eb74f1,sha256=GKt_CAJNOQXwGnoFLuiNpkd0s_hP_UDLKd59VRknYy0,330
@@ -40,10 +41,12 @@ pembot/.git/objects/1f/83a471c8119f7794d98c049170a5d7d07a4b71,sha256=XnMaYQUA8iT
40
41
  pembot/.git/objects/28/db0ab48059acccd7d257aa02e52e9b6b83a4a5,sha256=S6PrWSQlkifYxKIgFdU0PZD0uLebS6uAP2LAUwp5yOI,91
41
42
  pembot/.git/objects/35/97e518a8658280be9f377f78edf1dfa1f23814,sha256=gfc5bFLVZpwNQb1Ox2VosDYAjw0Lc5ZLjmvNA8gWcmg,2546
42
43
  pembot/.git/objects/3d/07d3b29ff53d95de3898fb786d61732f210515,sha256=A9MNZO3QZ6ghGd1MyfmJ6H3dBTpF4HZcRosVxWytx8E,4077
44
+ pembot/.git/objects/3e/23850624fcf5f111d6ea88ddd64adf924cf82f,sha256=ygVUpaLo7cxUdIgjFlaBh2BkllV6BIYYkzLIxsPKjWE,4111
43
45
  pembot/.git/objects/3e/cf23eb95123287531d708a21d4ba88d92ccabb,sha256=Jlg3XIzIjk3N5ZKolXbz_betMybJ2t2TVuOARg2ruQU,4943
44
46
  pembot/.git/objects/3f/78215d7e17da726fb352fd92b3c117db9b63ba,sha256=J8r5hqTEgAwlH5sDjr9tp1ipqpvs4BAVQY5rkiKqDCw,4080
45
47
  pembot/.git/objects/3f/e072cf3cb6a9f30c3e9936e3ddf622e80270d0,sha256=Z-UoKi2MYe0qGTtBxAr5cnIOHKkhoEXMgalevFUz9lA,2992
46
48
  pembot/.git/objects/41/cbeb6bcb4c6fa9ef9be571082d95ecb4ea0ee3,sha256=waMrzjG_o5D4JgHkjjqcDQCwuS17w60JRkVr25ZFlcI,117
49
+ pembot/.git/objects/4d/a03134f70896f72053fbdc0cd4f4c76d4ac1d8,sha256=GBhAvxM1omIt-PN6mNXYlIJMN5nx2AUE0ZOf68El5pc,117
47
50
  pembot/.git/objects/51/9e780574933d7627a083222bd10dd74f430904,sha256=3e3Iu2-waVySghbLYXmwhDPpfhV4PF82suvjcYkSVog,3604
48
51
  pembot/.git/objects/61/46a371b9c1bd9f51af273f11f986cfd1bedeba,sha256=KZvfnjxuriY54uWZQOM-GLovAvHs1k8_KwhpjNA5lW4,128
49
52
  pembot/.git/objects/63/1700a51c8fa97b543991f5f61bfcd1e7e1327d,sha256=sYkhBkrSPQ8klX2gPrXJUZVt2a0iaF7KC7NFGBuxgeY,4360
@@ -55,6 +58,7 @@ pembot/.git/objects/86/cdaec229f1fbebf43042266b03878944669f25,sha256=eTvQhUeYXP8
55
58
  pembot/.git/objects/87/d6df5217a4a374f8c1211a05f9bd657f72c9a7,sha256=OGq5-x1lFa94vTX7WYO6o4TGvCZwAvZ6LXm6N3dpiKM,3881
56
59
  pembot/.git/objects/8b/5be2af9b16f290549193859c214cd9072212e8,sha256=DhGeGisCdFZ0TcRKp5angRpaseI87TQDt5FtGZInstk,117
57
60
  pembot/.git/objects/93/8f29d9b4b1ae86e39dddf9e3d115a82ddfc9b6,sha256=xf8oZ5IBMTxfkH7MFfukV7ZIu0Apd-78eJTdlI7GBv0,90
61
+ pembot/.git/objects/95/28bbccd167e3f4ad583a1ae9fac98a52620e27,sha256=jwJdRviwjGJIyMpE_BM6mr7B9ofGEsI5ZToJo5nmlao,263
58
62
  pembot/.git/objects/9b/123713e30fc9e225f9ac8ff5b02f8f8cf86456,sha256=xIETiieOoilleucGg7vXOgjZ-v5PI0t34fDJjDD665A,4204
59
63
  pembot/.git/objects/ab/139d2cd4798dd8e2c565b80440b1a44b376126,sha256=v1UO-WINmigZNYD74kyIv310Kq5k4SNL-gQ2DYlw9xk,6258
60
64
  pembot/.git/objects/ab/c6b15265171457b41e2cfdaf3b8c3994a59eb7,sha256=ivRCkHzUZHXB16wn2ojARknUrwBkoUsV_18QT3Jbs-k,205
@@ -62,11 +66,14 @@ pembot/.git/objects/ac/9c9018c62fa30dc142665c1b5a375f4e056880,sha256=P_8LPBV0v4D
62
66
  pembot/.git/objects/b1/1173d9b68db117437ccb9551461152e1e8a77d,sha256=6cl8NMNQ9b5fBh97GPEQNssOVrh-EQLJfhqSBbNb_vU,205
63
67
  pembot/.git/objects/b2/4e79ab07fe9e68781961a25ff9f1dbb1546fbb,sha256=zfd9KnP9YtBMwzci1BMWFHAQR4BWJ3XQsyr-rFqdw0Q,135
64
68
  pembot/.git/objects/b8/eea52176ffa4d88c5a9976bee26092421565d3,sha256=xCom1B6wyws8ZNTJoIL4JtVIXNv1yPCwsXfNsVCAGQA,4410
69
+ pembot/.git/objects/bd/8fd1cb166996e74a8631f3a6f764a53af75297,sha256=JOkICUEv6tdVp7mYDUKtXnsWq3IIZSmm8iUP7OqQwc4,56
65
70
  pembot/.git/objects/bf/068a0714e2145de83a5c004f4213b091439d0e,sha256=MpiiCqAk6GQ5iGzeThU0rsabrgA5tCAgdIWudAM0IrA,420
66
71
  pembot/.git/objects/bf/32a7e6872e5dc4025ee3df3c921ec7ade0855f,sha256=lwL9ickzIFtMJgNKaPp6nTGDlMhPs6fkZTWevQWK_Lc,56
72
+ pembot/.git/objects/bf/518686b06069d2a8abd3689908b7e1a6e16b05,sha256=w-HgdJdX2_ZdiIptJv8BcWdeDEyhl42WEk8P72X8YKU,421
67
73
  pembot/.git/objects/c0/793458db6e1bee7f79f1a504fb8ff4963f8ed3,sha256=b8lo_OrMeGgirc9yY_OFjv5xVpG6FBpZnBf7jbtlmyw,421
68
74
  pembot/.git/objects/c2/443060c07101948487cfa93cc39e082e9e0f5f,sha256=d9rjB8sgBOUQ-HQ8yu5I-c5Dqr_q2z0OOCXSufjDAak,3998
69
75
  pembot/.git/objects/d0/937f7d832266337289d5ec09459f931a46fcf7,sha256=_RZ7Z2EZp1OOF_XZhY6e1tzWwhI8Fa5R9aaF_W8APBA,56
76
+ pembot/.git/objects/e0/9162dbd64d85bb5ed740aa99faefa73f293d78,sha256=I5fpz3BQ2maFPTSu43T1uvYMuLiep1C3K6CsX8UMNPI,196
70
77
  pembot/.git/objects/e5/3070f2b07f45d031444b09b1b38658f3caf29e,sha256=irJ-z8kPZmg85B0f4TQz73yJoCMWMWsIR3Pi5wx1Dlk,4034
71
78
  pembot/.git/objects/e7/911a702079a6144997ea4e70f59abbe59ec2bc,sha256=r4zY-__F4gSfjE7onRTrcxvv8umXKuPuFzd95AiQ0cs,392
72
79
  pembot/.git/objects/e9/1172752e9a421ae463112d2b0506b37498c98d,sha256=qWZpM65kQPSxlVHAtyzH5L-j3rL-b9Jw-A7YBm4NMlI,249
@@ -83,26 +90,28 @@ pembot/.git/objects/fc/e56f1e09d09a05b9babf796fb40bece176f3a2,sha256=g-IVuI_8YBn
83
90
  pembot/.git/objects/pack/pack-d5469edc8c36e3bb1de5e0070e4d5b1eae935dd4.idx,sha256=CNzx_lz6v4PulPxRW2t9nz-ifvplpSFPhMA2M9WNUrA,3424
84
91
  pembot/.git/objects/pack/pack-d5469edc8c36e3bb1de5e0070e4d5b1eae935dd4.pack,sha256=dk3Sqrd0L-tNVLRy3uJdTYJNkw8v59mE1hV8zrCFNzc,41355
85
92
  pembot/.git/objects/pack/pack-d5469edc8c36e3bb1de5e0070e4d5b1eae935dd4.rev,sha256=7U3tpTWQ3dn5dwQo_KWMWxF31cKaDnCk2AzTO7Cx4Bg,388
86
- pembot/.git/refs/heads/main,sha256=54t3FdOxyxoG7KRRUgPS3G9f4WGNrpxuNVAPZbd3N0o,41
93
+ pembot/.git/refs/heads/main,sha256=Sz6HMFv8rlaBjeNHaBfSrRUorGipDPAJnfxmiUADG5I,41
87
94
  pembot/.git/refs/remotes/origin/HEAD,sha256=K7aiSqD8bEhBAPXVGim7rYQc0sdV9dk_qiBOXbtOsrQ,30
88
- pembot/.git/refs/remotes/origin/main,sha256=54t3FdOxyxoG7KRRUgPS3G9f4WGNrpxuNVAPZbd3N0o,41
95
+ pembot/.git/refs/remotes/origin/main,sha256=Sz6HMFv8rlaBjeNHaBfSrRUorGipDPAJnfxmiUADG5I,41
89
96
  pembot/AnyToText/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
90
- pembot/AnyToText/convertor.py,sha256=26Pq4OLhVNHgIhJdLLcxGPFTtdnG2lsQkR_53_zkZZM,16997
97
+ pembot/AnyToText/convertor.py,sha256=8fDFxjyiL8H9mhZTjmxgePQj-sVZCHnEfMooYMqt6wk,17104
91
98
  pembot/TextEmbedder/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
92
99
  pembot/TextEmbedder/gemini_embedder.py,sha256=P679-2mmQESlYKML1vcrwx_-CSgWJgIQk7NL4F7BLQE,677
93
100
  pembot/TextEmbedder/mongodb_embedder.py,sha256=pD8mP-uC_o0COPdOrCTMpoC5PdF8hXlqARHvTr2T-VI,9642
94
101
  pembot/TextEmbedder/mongodb_index_creator.py,sha256=ejpsF_y1zY6Z0nux02vjODiDPnxx-YA_xy2PmT94zZ4,5306
95
102
  pembot/TextEmbedder/vector_query.py,sha256=Kh1uhx9CatB-oQlQtnW-1I2Qz7MGHI20n2h_8peAChM,1986
96
- pembot/config/config.yaml,sha256=xANYakwM7ZTuPH89FNY-Z1V050Lvi_HtHnOUvIKjBqs,156
103
+ pembot/config/config.yaml,sha256=JHvRjzmkPIdKjryQY3W375B1IQgFvbumQ727AwvRW7U,156
97
104
  pembot/pdf2markdown/LICENSE,sha256=1JTJhQjUYDqJzFJhNtitm7mHyE71PRHgetIqRRWg6Pk,1068
98
105
  pembot/pdf2markdown/README.md,sha256=jitM1pwI69oa0N4mXv5-SY1ka9Sz3jsRNCDdpW-50kY,4545
99
106
  pembot/pdf2markdown/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
100
- pembot/pdf2markdown/extract.py,sha256=PreFHesr8T6pVzgX87nSbPBlSsKfQZKOf5CVya2ys0s,35212
107
+ pembot/pdf2markdown/extract.py,sha256=ylkPfxMJiePUKmGlZ8B3fz51FtG17Q8P27KGLrz9J48,33289
108
+ pembot/pdf2markdown/pyrightconfig.json,sha256=Vt_k4N2LtZhth0lQOQAOnRKDOQkYYVzmdtb-bP3gu7M,47
101
109
  pembot/pdf2markdown/requirements.txt,sha256=0vZQzkSZKLNVUttd4euoDyYEy0nc2W3CIVxhepHW5Ho,76
110
+ pembot/pdf2markdown/.git/COMMIT_EDITMSG,sha256=n3-nJDAjMCjnbADDTrmOPQYgrUSZRElYCQsXxv_AS1g,64
102
111
  pembot/pdf2markdown/.git/HEAD,sha256=KNJb-Cr0wOK3L1CVmyvrhZ4-YLljCl6MYD2tTdsrboA,21
103
- pembot/pdf2markdown/.git/config,sha256=ltEWI476vFz2goGWD7QmCDvC6UCQ9ELviXuURlvte_w,269
112
+ pembot/pdf2markdown/.git/config,sha256=bxpN4Vp2IKsAw9QkRoCIXULseCngmK7OQMg_81HDmww,398
104
113
  pembot/pdf2markdown/.git/description,sha256=hatsFj1DoX6pz3eIMIvKFGbxsKjRzJLibpv2PaQGKu4,73
105
- pembot/pdf2markdown/.git/index,sha256=nB0bXnuMd6Gde6mwU2J3v3xf7FR-BAxLUrije4qQ4IY,488
114
+ pembot/pdf2markdown/.git/index,sha256=b6ZJzQ8qBONIIcgKXWOnK38rPQ1h93xQGpjMUmeVhqc,656
106
115
  pembot/pdf2markdown/.git/packed-refs,sha256=kJfKR7KBh8Ao4cGF_14wFxiFMP_lBLTKdXRAB2UMQ_o,112
107
116
  pembot/pdf2markdown/.git/hooks/applypatch-msg.sample,sha256=AiNJeguLAzqlijpSG4YphpOGz3qw4vEBlj0yiqYhk_c,478
108
117
  pembot/pdf2markdown/.git/hooks/commit-msg.sample,sha256=H3TV6SkpebVz69WXQdRsuT_zkazdCD00C5Q3B1PZJDc,896
@@ -119,19 +128,32 @@ pembot/pdf2markdown/.git/hooks/push-to-checkout.sample,sha256=pT0HQXmLKHxt16-mSu
119
128
  pembot/pdf2markdown/.git/hooks/sendemail-validate.sample,sha256=ROv8kj3FRmvACWAvDs8Ge5xlRZq_6IaN3Em3jmztepI,2308
120
129
  pembot/pdf2markdown/.git/hooks/update.sample,sha256=jV8vqD4QPPCLV-qmdSHfkZT0XL28s32lKtWGCXoU0QY,3650
121
130
  pembot/pdf2markdown/.git/info/exclude,sha256=ZnH-g7egfIky7okWTR8nk7IxgFjri5jcXAbuClo7DsE,240
122
- pembot/pdf2markdown/.git/logs/HEAD,sha256=jJscThcgJ-i1V19vA4RVs9acp0QIKsVSwY9zAmV3tjU,193
123
- pembot/pdf2markdown/.git/logs/refs/heads/main,sha256=jJscThcgJ-i1V19vA4RVs9acp0QIKsVSwY9zAmV3tjU,193
131
+ pembot/pdf2markdown/.git/logs/HEAD,sha256=kgz5CoaL_AuYbbsv4KXiCvuqydnLusQUvmjDdzMtl6U,1002
132
+ pembot/pdf2markdown/.git/logs/refs/heads/main,sha256=kgz5CoaL_AuYbbsv4KXiCvuqydnLusQUvmjDdzMtl6U,1002
133
+ pembot/pdf2markdown/.git/logs/refs/remotes/myorigin/main,sha256=bmsvulVWtYHEVD_JpjpPFXZKCZd9dZVkA-XT3fzBauw,438
124
134
  pembot/pdf2markdown/.git/logs/refs/remotes/origin/HEAD,sha256=jJscThcgJ-i1V19vA4RVs9acp0QIKsVSwY9zAmV3tjU,193
135
+ pembot/pdf2markdown/.git/objects/14/251b198e0bac39a3dc3b42f9e57b20c01465fb,sha256=Ssx4RupGzteVz0Irtgh95-Ccnacskv8ql8zLtqUgmOE,209
136
+ pembot/pdf2markdown/.git/objects/24/8f03b5f969a7fbd396b496f40b57f0ae81c148,sha256=ScB91DWSzfIrFLnghWglGqxxxmHxzODACQiXJEHDeWA,229
137
+ pembot/pdf2markdown/.git/objects/57/74dc9c3901d2ffb2cd7dafe2ad6612a7f9f42c,sha256=0Vkgzw7kU0cludbgJUyqCWLgK5Q3vfFnoKmeLq6c-uU,52
138
+ pembot/pdf2markdown/.git/objects/72/2dc14f82e78ce41717348b256e0c17834933b4,sha256=062pZN8JWfsC9z4MKIEgUcLIdnjzC6hwPjjsvHDhW-M,266
139
+ pembot/pdf2markdown/.git/objects/79/eb7b93ced70e399bd561093c45de7641414dbd,sha256=4mcMnseFu9SBgw2L5xJe3V_Lb5ZcjBRv1Dc-pAZrznw,9793
140
+ pembot/pdf2markdown/.git/objects/8d/9ce1fd9733a78c592b34af9c94b98960c601ed,sha256=eJMRf2BFDCxSgPuVPPLd6zZu3NmwMeYVYwyxW9QkW6M,9772
141
+ pembot/pdf2markdown/.git/objects/95/745843bb4377d6042180daeda818c0b16fd493,sha256=ddMj81nqLqqtVtrJ6TV7eOEjrzq38AbIjgWAPj0MaT8,12391
142
+ pembot/pdf2markdown/.git/objects/a5/c6dfb577782c259990dcf977e355298e923428,sha256=c6vkmaxLJ8-6V2DykAhGnGUFJc1EH_-TuDeijrrHRWg,265
143
+ pembot/pdf2markdown/.git/objects/b4/8d697aa9fd97151eb2a84a1af5d408b7630232,sha256=nSKTkx4mVrz7uaJkacuDJH7KO-vR1-OrvBV-e2HQvm0,194
144
+ pembot/pdf2markdown/.git/objects/b8/702320e56074e9680181d8b7897d6a0a552e2d,sha256=-XJJ4C0svu4LaZ9Zi3pAWVvy18w2CJ2lg16Zr2Hnu-U,372
145
+ pembot/pdf2markdown/.git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391,sha256=FkxfqAZ_rPGkPwnOPQ416_U6f1cj7L8VqGZ8_FPCb2w,15
125
146
  pembot/pdf2markdown/.git/objects/pack/pack-d3051affdd6c31306dc53489168fc870872085d1.idx,sha256=nZ0BJQYRC49OtqnyhZR_teR85PqslUG6j16UAKoX8m4,3452
126
147
  pembot/pdf2markdown/.git/objects/pack/pack-d3051affdd6c31306dc53489168fc870872085d1.pack,sha256=_KzHMGgrVzHGn2ZiKyHlvqc-BwTEeq3PqDPPJ9DYI5E,32222
127
148
  pembot/pdf2markdown/.git/objects/pack/pack-d3051affdd6c31306dc53489168fc870872085d1.rev,sha256=1jASJFjt2r2Sxd2G87oSTfrQnowK2ThvjVlWTIF-47E,392
128
- pembot/pdf2markdown/.git/refs/heads/main,sha256=E8R7BznAiEnrV1-8RJMSrGaOr2vzg00LO6dN7x8O37o,41
149
+ pembot/pdf2markdown/.git/refs/heads/main,sha256=oRkN5qBSGT5N23aQ_E4DIUGMZLPez-Cij_1QgK-k3jI,41
150
+ pembot/pdf2markdown/.git/refs/remotes/myorigin/main,sha256=oRkN5qBSGT5N23aQ_E4DIUGMZLPez-Cij_1QgK-k3jI,41
129
151
  pembot/pdf2markdown/.git/refs/remotes/origin/HEAD,sha256=K7aiSqD8bEhBAPXVGim7rYQc0sdV9dk_qiBOXbtOsrQ,30
130
152
  pembot/pdf2markdown/config/config.yaml,sha256=w75W2Eg4-tu8rRk_23PqxWDh0010kRKLmPrh46f_Njc,66
131
153
  pembot/utils/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
132
154
  pembot/utils/inference_client.py,sha256=jeURmY2P5heVlH1dCV0XSgiX3U2qYGEmrnUv0KFpdww,5380
133
155
  pembot/utils/string_tools.py,sha256=gtRa5rBR0Q7GspTu2WtCnvhJQLFjPfWLvhmyiPkyStU,1883
134
- pembot-0.0.5.dist-info/licenses/LICENSE,sha256=OXLcl0T2SZ8Pmy2_dmlvKuetivmyPd5m1q-Gyd-zaYY,35149
135
- pembot-0.0.5.dist-info/WHEEL,sha256=Dyt6SBfaasWElUrURkknVFAZDHSTwxg3PaTza7RSbkY,100
136
- pembot-0.0.5.dist-info/METADATA,sha256=RUAzpxKZigCjcI-yk-WR1zD_u15rKhQqWAqAIhxnnNs,313
137
- pembot-0.0.5.dist-info/RECORD,,
156
+ pembot-0.0.6.dist-info/licenses/LICENSE,sha256=OXLcl0T2SZ8Pmy2_dmlvKuetivmyPd5m1q-Gyd-zaYY,35149
157
+ pembot-0.0.6.dist-info/WHEEL,sha256=Dyt6SBfaasWElUrURkknVFAZDHSTwxg3PaTza7RSbkY,100
158
+ pembot-0.0.6.dist-info/METADATA,sha256=jcibBPdDsmAbgWICvgecVgEEk_9wPQ4xDBkHpdhjKPw,313
159
+ pembot-0.0.6.dist-info/RECORD,,
File without changes