PyPI - nv-ingest-api - Versions diffs - 2025.5.18.dev20250518__py3-none-any.whl → 2025.5.19.dev20250519__py3-none-any.whl - Mend

nv-ingest-api 2025.5.18.dev20250518py3-none-any.whl → 2025.5.19.dev20250519py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of nv-ingest-api might be problematic. Click here for more details.

Files changed (8) hide show

nv_ingest_api/internal/extract/docx/engines/docxreader_helpers/docxreader.py CHANGED Viewed

@@ -274,59 +274,70 @@ class DocxReader:
             - A list of extracted images from the paragraph.
         """
-        paragraph_images = []
-        if self.paragraph_format == "text":
-            paragraph_text = paragraph.text
-        else:
-            # Get the default style of the paragraph, "markdown"
+        try:
+            paragraph_images = []
+            if self.paragraph_format == "text":
+                return paragraph.text.strip(), paragraph_images
             font = paragraph.style.font
             default_style = (font.bold, font.italic, font.underline)
-            # Iterate over the runs of the paragraph and group them by style, excluding empty runs
             paragraph_text = ""
             group_text = ""
             previous_style = None
             for c in paragraph.iter_inner_content():
-                if isinstance(c, Hyperlink):
-                    text = f"[{c.text}]({c.address})"
-                    style = (c.runs[0].bold, c.runs[0].italic, c.runs[0].underline)
-                elif isinstance(c, Run):
-                    text = c.text
-                    style = (c.bold, c.italic, c.underline)
-                    # 1. Locate the inline shape which is stored in the <w:drawing> element.
-                    # 2. r:embed in <a.blip> has the relationship id for extracting the file where
-                    # the image is stored as bytes.
-                    # Reference:
-                    # https://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml
-                    inline_shapes = c._element.xpath(".//w:drawing//a:blip/@r:embed")
-                    for r_id in inline_shapes:
-                        text += self.image_tag.format(self.image_tag_index)
-                        self.image_tag_index += 1
-                        image = paragraph.part.related_parts[r_id].image
-                        paragraph_images.append(image)
-                else:
-                    continue
-                style = tuple([s if s is not None else d for s, d in zip(style, default_style)])
-                # If the style changes for a non empty text, format the previous group and start a new one
-                if (not self.is_text_empty(text)) and (previous_style is not None):
-                    if style != previous_style:
+                try:
+                    if isinstance(c, Hyperlink):
+                        text = f"[{c.text}]({c.address})"
+                        style = (c.runs[0].bold, c.runs[0].italic, c.runs[0].underline)
+                    elif isinstance(c, Run):
+                        text = c.text
+                        style = (c.bold, c.italic, c.underline)
+                        # 1. Locate the inline shape which is stored in the <w:drawing> element.
+                        # 2. r:embed in <a.blip> has the relationship id for extracting the file where
+                        # the image is stored as bytes.
+                        # Reference:
+                        # https://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml
+                        inline_shapes = c._element.xpath(".//w:drawing//a:blip/@r:embed")
+                        for r_id in inline_shapes:
+                            text += self.image_tag.format(self.image_tag_index)
+                            self.image_tag_index += 1
+                            try:
+                                image = paragraph.part.related_parts[r_id].image
+                                paragraph_images.append(image)
+                            except Exception as img_e:
+                                logger.warning(
+                                    "Failed to extract image with rId " "%s: %s -- object / file may be malformed",
+                                    r_id,
+                                    img_e,
+                                )
+                    else:
+                        continue
+                    style = tuple(s if s is not None else d for s, d in zip(style, default_style))
+                    if not self.is_text_empty(text) and previous_style is not None and style != previous_style:
                         paragraph_text += self.format_text(group_text, *previous_style)
                         group_text = ""
-                group_text += text
-                if not self.is_text_empty(text):
-                    previous_style = style
+                    group_text += text
+                    if not self.is_text_empty(text):
+                        previous_style = style
-            # Format the last group
-            if group_text:
-                paragraph_text += self.format_text(group_text, *style)
+                except Exception as e:
+                    logger.error("format_paragraph: failed to process run: %s", e)
+                    continue
+            if group_text and previous_style:
+                paragraph_text += self.format_text(group_text, *previous_style)
+            return paragraph_text.strip(), paragraph_images
-        # Remove trailing spaces
-        paragraph_text = paragraph_text.strip()
-        return paragraph_text, paragraph_images
+        except Exception as e:
+            logger.error("format_paragraph: failed for paragraph: %s", e)
+            return "", []
     def format_cell(self, cell: "_Cell") -> Tuple[str, List["Image"]]:
         """
@@ -344,12 +355,23 @@ class DocxReader:
             - A list of images extracted from the cell.
         """
-        if self.paragraph_format == "markdown":
-            newline = "<br>"
-        else:
-            newline = "\n"
-        paragraph_texts, paragraph_images = zip(*[self.format_paragraph(p) for p in cell.paragraphs])
-        return newline.join(paragraph_texts), paragraph_images
+        try:
+            newline = "<br>" if self.paragraph_format == "markdown" else "\n"
+            texts, images = [], []
+            for p in cell.paragraphs:
+                try:
+                    t, imgs = self.format_paragraph(p)
+                    texts.append(t)
+                    images.extend(imgs)
+                except Exception as e:
+                    logger.error("format_cell: failed to format paragraph in cell: %s", e)
+            return newline.join(texts), images
+        except Exception as e:
+            logger.error("format_cell: failed entirely: %s", e)
+            return "", []
     def format_table(self, table: "Table") -> Tuple[Optional[str], List["Image"], DataFrame]:
         """
@@ -368,25 +390,50 @@ class DocxReader:
             - A DataFrame representation of the table's content.
         """
-        rows = [[self.format_cell(cell) for cell in row.cells] for row in table.rows]
-        texts = [[text for text, _ in row] for row in rows]
-        table_images = [image for row in rows for _, images in row for image in images]
-        table = pd.DataFrame(texts[1:], columns=texts[0])
-        if "markdown" in self.table_format:
-            table_text = table.to_markdown(index=False)
-            if self.table_format == "markdown_light":
-                table_text = re.sub(r"\s{2,}", " ", table_text)
-                table_text = re.sub(r"-{2,}", "-", table_text)
-        elif self.table_format == "csv":
-            table_text = table.to_csv()
-        elif self.table_format == "tag":
-            table_text = self.table_tag.format(self.table_tag_index)
-            self.table_tag_index += 1
-        else:
-            raise ValueError(f"Unknown table format {format}")
+        try:
+            rows_data = []
+            all_images = []
+            for row in table.rows:
+                row_texts = []
+                row_images = []
+                for cell in row.cells:
+                    try:
+                        cell_text, cell_imgs = self.format_cell(cell)
+                        row_texts.append(cell_text)
+                        row_images.extend(cell_imgs)
+                    except Exception as e:
+                        logger.error("format_table: failed to process cell: %s", e)
+                        row_texts.append("")  # pad for column alignment
+                rows_data.append(row_texts)
+                all_images.extend(row_images)
+            if not rows_data or not rows_data[0]:
+                return None, [], pd.DataFrame()
+            header = rows_data[0]
+            body = rows_data[1:]
+            df = pd.DataFrame(body, columns=header) if body else pd.DataFrame(columns=header)
+            if "markdown" in self.table_format:
+                table_text = df.to_markdown(index=False)
+                if self.table_format == "markdown_light":
+                    table_text = re.sub(r"\s{2,}", " ", table_text)
+                    table_text = re.sub(r"-{2,}", "-", table_text)
+            elif self.table_format == "csv":
+                table_text = df.to_csv(index=False)
+            elif self.table_format == "tag":
+                table_text = self.table_tag.format(self.table_tag_index)
+                self.table_tag_index += 1
+            else:
+                raise ValueError(f"Unknown table format {self.table_format}")
+            return table_text, all_images, df
-        return table_text, table_images, table
+        except Exception as e:
+            logger.error("format_table: failed to format table: %s", e)
+            return None, [], pd.DataFrame()
     @staticmethod
     def apply_text_style(style: str, text: str, level: int = 0) -> str:
@@ -841,30 +888,39 @@ class DocxReader:
         self._prev_para_image_idx = 0
         para_idx = 0
         for child in self.document.element.body.iterchildren():
-            if isinstance(child, CT_P):
-                paragraph = Paragraph(child, self.document)
-                paragraph_text, paragraph_images = self.format_paragraph(paragraph)
-                if extract_text:
-                    self._extract_para_text(
-                        paragraph,
-                        paragraph_text,
-                        base_unified_metadata,
-                        text_depth,
-                        para_idx,
-                    )
-                if (extract_charts or extract_images or extract_tables) and paragraph_images:
-                    self._prev_para_images = paragraph_images
-                    self._prev_para_image_idx = para_idx
-                    self._pending_images += [(image, para_idx, "", base_unified_metadata) for image in paragraph_images]
-                    self.images += paragraph_images
+            try:
+                if isinstance(child, CT_P):
+                    paragraph = Paragraph(child, self.document)
+                    paragraph_text, paragraph_images = self.format_paragraph(paragraph)
+                    if extract_text:
+                        try:
+                            self._extract_para_text(
+                                paragraph,
+                                paragraph_text,
+                                base_unified_metadata,
+                                text_depth,
+                                para_idx,
+                            )
+                        except Exception as e:
+                            logger.error("extract_data: _extract_para_text failed: %s", e)
+                    if (extract_images or extract_charts or extract_tables) and paragraph_images:
+                        self._pending_images += [
+                            (image, para_idx, "", base_unified_metadata) for image in paragraph_images
+                        ]
+                        self.images.extend(paragraph_images)
+                elif isinstance(child, CT_Tbl):
+                    if extract_tables or extract_charts:
+                        try:
+                            self._extract_table_data(child, base_unified_metadata)
+                        except Exception as e:
+                            logger.error("extract_data: _extract_table_data failed: %s", e)
-            elif isinstance(child, CT_Tbl):
-                if extract_tables or extract_charts:
-                    self._extract_table_data(child, base_unified_metadata)
+            except Exception as e:
+                logger.error("extract_data: failed to process element at index %d: %s", para_idx, e)
             para_idx += 1

nv_ingest_api/internal/extract/pptx/engines/pptx_helper.py CHANGED Viewed

@@ -27,9 +27,9 @@ from typing import Optional
 import pandas as pd
 from pptx import Presentation
 from pptx.enum.dml import MSO_COLOR_TYPE
-from pptx.enum.dml import MSO_THEME_COLOR
+from pptx.enum.dml import MSO_THEME_COLOR  # noqa
 from pptx.enum.shapes import MSO_SHAPE_TYPE
-from pptx.enum.shapes import PP_PLACEHOLDER
+from pptx.enum.shapes import PP_PLACEHOLDER  # noqa
 from pptx.shapes.autoshape import Shape
 from pptx.slide import Slide
@@ -220,20 +220,13 @@ def python_pptx(
     extraction_config: dict,
     execution_trace_log: Optional[List] = None,
 ):
-    """
-    Uses python-pptx to extract text from a PPTX bytestream, while deferring image
-    classification into tables/charts if requested.
-    """
-    _ = extract_infographics  # Placeholder for future use
-    _ = execution_trace_log  # Placeholder for future use
+    _ = extract_infographics
+    _ = execution_trace_log
     row_data = extraction_config.get("row_data")
     source_id = row_data["source_id"]
-    text_depth = extraction_config.get("text_depth", "page")
-    text_depth = TextTypeEnum[text_depth.upper()]
+    text_depth = TextTypeEnum[extraction_config.get("text_depth", "page").upper()]
     paragraph_format = extraction_config.get("paragraph_format", "markdown")
     identify_nearby_objects = extraction_config.get("identify_nearby_objects", True)
@@ -241,16 +234,19 @@ def python_pptx(
     pptx_extractor_config = extraction_config.get("pptx_extraction_config", {})
     trace_info = extraction_config.get("trace_info", {})
-    base_unified_metadata = row_data[metadata_col] if metadata_col in row_data.index else {}
+    base_unified_metadata = row_data.get(metadata_col, {})
     base_source_metadata = base_unified_metadata.get("source_metadata", {})
     source_location = base_source_metadata.get("source_location", "")
     collection_id = base_source_metadata.get("collection_id", "")
     partition_id = base_source_metadata.get("partition_id", -1)
     access_level = base_source_metadata.get("access_level", AccessLevelEnum.UNKNOWN)
-    presentation = Presentation(pptx_stream)
+    try:
+        presentation = Presentation(pptx_stream)
+    except Exception as e:
+        logger.error("Failed to open PPTX presentation: %s", e)
+        return []
-    # Collect source metadata from the core properties of the document.
     last_modified = (
         presentation.core_properties.modified.isoformat()
         if presentation.core_properties.modified
@@ -262,12 +258,11 @@ def python_pptx(
         else datetime.now().isoformat()
     )
     keywords = presentation.core_properties.keywords
-    source_type = DocumentTypeEnum.PPTX
     source_metadata = {
-        "source_name": source_id,  # python-pptx doesn't maintain filename; re-use source_id
+        "source_name": source_id,
         "source_id": source_id,
         "source_location": source_location,
-        "source_type": source_type,
+        "source_type": DocumentTypeEnum.PPTX,
         "collection_id": collection_id,
         "date_created": date_created,
         "last_modified": last_modified,
@@ -277,18 +272,16 @@ def python_pptx(
     }
     slide_count = len(presentation.slides)
     accumulated_text = []
     extracted_data = []
-    # Hold images here for final classification.
-    # Each item is (shape, shape_idx, slide_idx, slide_count, page_nearby_blocks, source_metadata,
-    #   base_unified_metadata)
     pending_images = []
     for slide_idx, slide in enumerate(presentation.slides):
-        # Obtain a flat list of shapes (ungrouped) sorted by top then left.
-        shapes = sorted(ungroup_shapes(slide.shapes), key=_safe_position)
+        try:
+            shapes = sorted(ungroup_shapes(slide.shapes), key=_safe_position)
+        except Exception as e:
+            logger.error("Slide %d: Failed to ungroup or sort shapes: %s", slide_idx, e)
+            continue
         page_nearby_blocks = {
             "text": {"content": [], "bbox": []},
@@ -297,152 +290,179 @@ def python_pptx(
         }
         for shape_idx, shape in enumerate(shapes):
-            block_text = []
-            added_title = added_subtitle = False
-            # ---------------------------------------------
-            # 1) Text Extraction
-            # ---------------------------------------------
-            if extract_text and shape.has_text_frame:
-                for paragraph_idx, paragraph in enumerate(shape.text_frame.paragraphs):
-                    if not paragraph.text.strip():
-                        continue
-                    for run_idx, run in enumerate(paragraph.runs):
-                        text = run.text
-                        if not text:
-                            continue
+            try:
+                block_text = []
+                added_title = added_subtitle = False
-                        text = escape_text(text)
+                # Text extraction
+                if extract_text and shape.has_text_frame:
+                    for paragraph_idx, paragraph in enumerate(shape.text_frame.paragraphs):
+                        if not paragraph.text.strip():
+                            continue
-                        if paragraph_format == "markdown":
-                            if is_title(shape):
-                                if not added_title:
-                                    text = process_title(shape)
-                                    added_title = True
-                                else:
+                        for run_idx, run in enumerate(paragraph.runs):
+                            try:
+                                text = run.text
+                                if not text:
                                     continue
-                            elif is_subtitle(shape):
-                                if not added_subtitle:
-                                    text = process_subtitle(shape)
-                                    added_subtitle = True
-                                else:
-                                    continue
-                            else:
-                                if run.hyperlink.address:
-                                    text = get_hyperlink(text, run.hyperlink.address)
-                                if is_accent(paragraph.font) or is_accent(run.font):
-                                    text = format_text(text, italic=True)
-                                elif is_strong(paragraph.font) or is_strong(run.font):
-                                    text = format_text(text, bold=True)
-                                elif is_underlined(paragraph.font) or is_underlined(run.font):
-                                    text = format_text(text, underline=True)
-                                if is_list_block(shape):
-                                    text = "  " * paragraph.level + "* " + text
-                        accumulated_text.append(text)
-                        # For "nearby objects", store block text.
-                        if extract_images and identify_nearby_objects:
-                            block_text.append(text)
-                        # If we only want text at SPAN level, flush after each run.
-                        if text_depth == TextTypeEnum.SPAN:
-                            text_extraction = _construct_text_metadata(
+                                text = escape_text(text)
+                                if paragraph_format == "markdown":
+                                    if is_title(shape) and not added_title:
+                                        text = process_title(shape)
+                                        added_title = True
+                                    elif is_subtitle(shape) and not added_subtitle:
+                                        text = process_subtitle(shape)
+                                        added_subtitle = True
+                                    elif is_title(shape) or is_subtitle(shape):
+                                        continue  # already added
+                                    if run.hyperlink and run.hyperlink.address:
+                                        text = get_hyperlink(text, run.hyperlink.address)
+                                    if is_accent(paragraph.font) or is_accent(run.font):
+                                        text = format_text(text, italic=True)
+                                    elif is_strong(paragraph.font) or is_strong(run.font):
+                                        text = format_text(text, bold=True)
+                                    elif is_underlined(paragraph.font) or is_underlined(run.font):
+                                        text = format_text(text, underline=True)
+                                    if is_list_block(shape):
+                                        text = "  " * paragraph.level + "* " + text
+                                accumulated_text.append(text)
+                                if extract_images and identify_nearby_objects:
+                                    block_text.append(text)
+                                if text_depth == TextTypeEnum.SPAN:
+                                    extracted_data.append(
+                                        _construct_text_metadata(
+                                            presentation,
+                                            shape,
+                                            accumulated_text,
+                                            keywords,
+                                            slide_idx,
+                                            shape_idx,
+                                            paragraph_idx,
+                                            run_idx,
+                                            slide_count,
+                                            text_depth,
+                                            source_metadata,
+                                            base_unified_metadata,
+                                        )
+                                    )
+                                    accumulated_text = []
+                            except Exception as e:
+                                logger.warning(
+                                    "Slide %d Shape %d Run %d: Failed to process run: %s",
+                                    slide_idx,
+                                    shape_idx,
+                                    run_idx,
+                                    e,
+                                )
+                        if accumulated_text and not accumulated_text[-1].endswith("\n\n"):
+                            accumulated_text.append("\n\n")
+                        if text_depth == TextTypeEnum.LINE:
+                            extracted_data.append(
+                                _construct_text_metadata(
+                                    presentation,
+                                    shape,
+                                    accumulated_text,
+                                    keywords,
+                                    slide_idx,
+                                    shape_idx,
+                                    paragraph_idx,
+                                    -1,
+                                    slide_count,
+                                    text_depth,
+                                    source_metadata,
+                                    base_unified_metadata,
+                                )
+                            )
+                            accumulated_text = []
+                    if text_depth == TextTypeEnum.BLOCK:
+                        extracted_data.append(
+                            _construct_text_metadata(
                                 presentation,
                                 shape,
                                 accumulated_text,
                                 keywords,
                                 slide_idx,
                                 shape_idx,
-                                paragraph_idx,
-                                run_idx,
+                                -1,
+                                -1,
                                 slide_count,
                                 text_depth,
                                 source_metadata,
                                 base_unified_metadata,
                             )
-                            if len(text_extraction) > 0:
-                                extracted_data.append(text_extraction)
-                            accumulated_text = []
+                        )
+                        accumulated_text = []
-                    # Add newlines for separation at line/paragraph level.
-                    if accumulated_text and not accumulated_text[-1].endswith("\n\n"):
-                        accumulated_text.append("\n\n")
+                if extract_images and identify_nearby_objects and block_text:
+                    page_nearby_blocks["text"]["content"].append("".join(block_text))
+                    page_nearby_blocks["text"]["bbox"].append(get_bbox(shape_object=shape))
-                    if text_depth == TextTypeEnum.LINE:
-                        text_extraction = _construct_text_metadata(
-                            presentation,
+                # Image processing (deferred)
+                if extract_images:
+                    try:
+                        process_shape(
                             shape,
-                            accumulated_text,
-                            keywords,
-                            slide_idx,
                             shape_idx,
-                            paragraph_idx,
-                            -1,
+                            slide_idx,
                             slide_count,
-                            text_depth,
+                            pending_images,
+                            page_nearby_blocks,
                             source_metadata,
                             base_unified_metadata,
                         )
-                        if len(text_extraction) > 0:
-                            extracted_data.append(text_extraction)
-                        accumulated_text = []
+                    except Exception as e:
+                        logger.warning("Slide %d Shape %d: Failed to process image shape: %s", slide_idx, shape_idx, e)
+                # Table extraction
+                if extract_tables and shape.has_table:
+                    try:
+                        extracted_data.append(
+                            _construct_table_metadata(
+                                shape, slide_idx, slide_count, source_metadata, base_unified_metadata
+                            )
+                        )
+                    except Exception as e:
+                        logger.warning("Slide %d Shape %d: Failed to extract table: %s", slide_idx, shape_idx, e)
-                if text_depth == TextTypeEnum.BLOCK:
-                    text_extraction = _construct_text_metadata(
-                        presentation,
-                        shape,
-                        accumulated_text,
-                        keywords,
-                        slide_idx,
-                        shape_idx,
-                        -1,
-                        -1,
-                        slide_count,
-                        text_depth,
-                        source_metadata,
-                        base_unified_metadata,
-                    )
-                    if len(text_extraction) > 0:
-                        extracted_data.append(text_extraction)
-                    accumulated_text = []
-            if extract_images and identify_nearby_objects and block_text:
-                page_nearby_blocks["text"]["content"].append("".join(block_text))
-                page_nearby_blocks["text"]["bbox"].append(get_bbox(shape_object=shape))
-            # ---------------------------------------------
-            # 2) Image Handling (DEFERRED) with nested/group shapes
-            # ---------------------------------------------
-            if extract_images:
-                process_shape(
-                    shape,
-                    shape_idx,
+            except Exception as e:
+                logger.warning("Slide %d Shape %d: Top-level failure: %s", slide_idx, shape_idx, e)
+        if extract_text and text_depth == TextTypeEnum.PAGE and accumulated_text:
+            extracted_data.append(
+                _construct_text_metadata(
+                    presentation,
+                    None,
+                    accumulated_text,
+                    keywords,
                     slide_idx,
+                    -1,
+                    -1,
+                    -1,
                     slide_count,
-                    pending_images,
-                    page_nearby_blocks,
+                    text_depth,
                     source_metadata,
                     base_unified_metadata,
                 )
+            )
+            accumulated_text = []
-            # ---------------------------------------------
-            # 3) Table Handling
-            # ---------------------------------------------
-            if extract_tables and shape.has_table:
-                table_extraction = _construct_table_metadata(
-                    shape, slide_idx, slide_count, source_metadata, base_unified_metadata
-                )
-                extracted_data.append(table_extraction)
-        if extract_text and (text_depth == TextTypeEnum.PAGE) and (len(accumulated_text) > 0):
-            text_extraction = _construct_text_metadata(
+    if extract_text and text_depth == TextTypeEnum.DOCUMENT and accumulated_text:
+        extracted_data.append(
+            _construct_text_metadata(
                 presentation,
-                shape,  # may pass None if preferred
+                None,
                 accumulated_text,
                 keywords,
-                slide_idx,
+                -1,
                 -1,
                 -1,
                 -1,
@@ -451,41 +471,20 @@ def python_pptx(
                 source_metadata,
                 base_unified_metadata,
             )
-            if len(text_extraction) > 0:
-                extracted_data.append(text_extraction)
-            accumulated_text = []
-    if extract_text and (text_depth == TextTypeEnum.DOCUMENT) and (len(accumulated_text) > 0):
-        text_extraction = _construct_text_metadata(
-            presentation,
-            shape,  # may pass None
-            accumulated_text,
-            keywords,
-            -1,
-            -1,
-            -1,
-            -1,
-            slide_count,
-            text_depth,
-            source_metadata,
-            base_unified_metadata,
         )
-        if len(text_extraction) > 0:
-            extracted_data.append(text_extraction)
-        accumulated_text = []
-    # ---------------------------------------------
-    # FINAL STEP: Finalize images (and tables/charts)
-    # ---------------------------------------------
     if extract_images or extract_tables or extract_charts:
-        _finalize_images(
-            pending_images,
-            extracted_data,
-            pptx_extractor_config,
-            extract_tables=extract_tables,
-            extract_charts=extract_charts,
-            trace_info=trace_info,
-        )
+        try:
+            _finalize_images(
+                pending_images,
+                extracted_data,
+                pptx_extractor_config,
+                extract_tables=extract_tables,
+                extract_charts=extract_charts,
+                trace_info=trace_info,
+            )
+        except Exception as e:
+            logger.error("Finalization of images failed: %s", e)
     return extracted_data

nv_ingest_api/internal/transform/split_text.py CHANGED Viewed

@@ -118,9 +118,15 @@ def transform_text_split_and_tokenize_internal(
     )
     # Filter to documents with text content.
-    bool_index = (df_transform_ledger["document_type"] == ContentTypeEnum.TEXT) & (
-        pd.json_normalize(df_transform_ledger["metadata"])["source_metadata.source_type"].isin(split_source_types)
-    )
+    text_type_condition = df_transform_ledger["document_type"] == ContentTypeEnum.TEXT
+    normalized_meta_df = pd.json_normalize(df_transform_ledger["metadata"], errors="ignore")
+    if "source_metadata.source_type" in normalized_meta_df.columns:
+        source_type_condition = normalized_meta_df["source_metadata.source_type"].isin(split_source_types)
+    else:
+        source_type_condition = False
+    bool_index = text_type_condition & source_type_condition
     df_filtered: pd.DataFrame = df_transform_ledger.loc[bool_index]
     if df_filtered.empty:

{nv_ingest_api-2025.5.18.dev20250518.dist-info → nv_ingest_api-2025.5.19.dev20250519.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: nv-ingest-api
-Version: 2025.5.18.dev20250518
+Version: 2025.5.19.dev20250519
 Summary: Python module with core document ingestion functions.
 Author-email: Jeremy Dyer <jdyer@nvidia.com>
 License:                                  Apache License

{nv_ingest_api-2025.5.18.dev20250518.dist-info → nv_ingest_api-2025.5.19.dev20250519.dist-info}/RECORD RENAMED Viewed

@@ -16,7 +16,7 @@ nv_ingest_api/internal/extract/docx/docx_extractor.py,sha256=jjbL12F5dtpbqHRbhL0
 nv_ingest_api/internal/extract/docx/engines/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 nv_ingest_api/internal/extract/docx/engines/docxreader_helpers/__init__.py,sha256=uLsBITo_XfgbwpzqXUm1IYX6XlZrTfx6T1cIhdILwG8,140
 nv_ingest_api/internal/extract/docx/engines/docxreader_helpers/docx_helper.py,sha256=1wkciAxu8lz9WuPuoleJFy2s09ieSzXl1S71F9r0BWA,4385
-nv_ingest_api/internal/extract/docx/engines/docxreader_helpers/docxreader.py,sha256=CM2yV8lfEw1F1ORAjupD4gyIKX0PDDJrL3nsZ5Mnrgg,31539
+nv_ingest_api/internal/extract/docx/engines/docxreader_helpers/docxreader.py,sha256=FOZZBD9gRRAr93qgK_L6o9xVBYD-6EE5-xI2-cWKvzo,33713
 nv_ingest_api/internal/extract/image/__init__.py,sha256=wQSlVx3T14ZgQAt-EPzEczQusXVW0W8yynnUaFFGE3s,143
 nv_ingest_api/internal/extract/image/chart_extractor.py,sha256=CkaW8ihPmGMQGrZh0ih14gtEpWuGOJ8InPQfZwpsP2g,13300
 nv_ingest_api/internal/extract/image/image_extractor.py,sha256=4tUWinuFMN3ukWa2tZa2_LtzRiTyUAUCBF6BDkUEvm0,8705
@@ -37,7 +37,7 @@ nv_ingest_api/internal/extract/pdf/engines/pdf_helpers/__init__.py,sha256=Jk3wrQ
 nv_ingest_api/internal/extract/pptx/__init__.py,sha256=HIHfzSig66GT0Uk8qsGBm_f13fKYcPtItBicRUWOOVA,183
 nv_ingest_api/internal/extract/pptx/pptx_extractor.py,sha256=o-0P2dDyRFW37uQi_lKk6-eFozTcZvbq-2Y4I0EBMIY,7749
 nv_ingest_api/internal/extract/pptx/engines/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
-nv_ingest_api/internal/extract/pptx/engines/pptx_helper.py,sha256=Lg2I1Zq-WJagsZibgyn__8T-M86BjkqAiXWNta9X_EU,29430
+nv_ingest_api/internal/extract/pptx/engines/pptx_helper.py,sha256=IZu0c_RHDSJwwclOZD3_tDu5jg4AEEfumbwKB78dUE0,29716
 nv_ingest_api/internal/mutate/__init__.py,sha256=wQSlVx3T14ZgQAt-EPzEczQusXVW0W8yynnUaFFGE3s,143
 nv_ingest_api/internal/mutate/deduplicate.py,sha256=hmvTTGevpCtlkM_wVZSoc8-Exr6rUJwqLjoEnbPcPzY,3849
 nv_ingest_api/internal/mutate/filter.py,sha256=H-hOTBVP-zLpvQr-FoGIJKxkhtj4l_sZ9V2Fgu3rTEM,5183
@@ -97,7 +97,7 @@ nv_ingest_api/internal/store/image_upload.py,sha256=GNlY4k3pfcHv3lzXxkbmGLeHFsf9
 nv_ingest_api/internal/transform/__init__.py,sha256=wQSlVx3T14ZgQAt-EPzEczQusXVW0W8yynnUaFFGE3s,143
 nv_ingest_api/internal/transform/caption_image.py,sha256=RYL_b26zfaRlbHz0XvLw9HwaMlXpNhr7gayjxGzdALQ,8545
 nv_ingest_api/internal/transform/embed_text.py,sha256=F8kg-WXihtuUMwDQUUYjnfGDCdQp1Mkd-jeThOiJT0s,16507
-nv_ingest_api/internal/transform/split_text.py,sha256=y6NYRkCEVpVsDu-AqrKx2D6JPp1vwxclw9obNZNJIIs,6561
+nv_ingest_api/internal/transform/split_text.py,sha256=DlVoyHLqZ-6_FiWwZmofPcq7TX8Ta23hIE0St9tw1IY,6822
 nv_ingest_api/util/__init__.py,sha256=wQSlVx3T14ZgQAt-EPzEczQusXVW0W8yynnUaFFGE3s,143
 nv_ingest_api/util/control_message/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 nv_ingest_api/util/control_message/validators.py,sha256=KvvbyheJ5rbzvJbH9JKpMR9VfoI0b0uM6eTAZte1p44,1315
@@ -147,8 +147,8 @@ nv_ingest_api/util/service_clients/rest/rest_client.py,sha256=dZ-jrk7IK7oNtHoXFS
 nv_ingest_api/util/string_processing/__init__.py,sha256=mkwHthyS-IILcLcL1tJYeF6mpqX3pxEw5aUzDGjTSeU,1411
 nv_ingest_api/util/system/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
 nv_ingest_api/util/system/hardware_info.py,sha256=ORZeKpH9kSGU_vuPhyBwkIiMyCViKUX2CP__MCjrfbU,19463
-nv_ingest_api-2025.5.18.dev20250518.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
-nv_ingest_api-2025.5.18.dev20250518.dist-info/METADATA,sha256=JlVjLzmSn4zx25vzOggr993vge5gS2VflDsBw84dG6M,13889
-nv_ingest_api-2025.5.18.dev20250518.dist-info/WHEEL,sha256=Nw36Djuh_5VDukK0H78QzOX-_FQEo6V37m3nkm96gtU,91
-nv_ingest_api-2025.5.18.dev20250518.dist-info/top_level.txt,sha256=abjYMlTJGoG5tOdfIB-IWvLyKclw6HLaRSc8MxX4X6I,14
-nv_ingest_api-2025.5.18.dev20250518.dist-info/RECORD,,
+nv_ingest_api-2025.5.19.dev20250519.dist-info/licenses/LICENSE,sha256=xx0jnfkXJvxRnG63LTGOxlggYnIysveWIZ6H3PNdCrQ,11357
+nv_ingest_api-2025.5.19.dev20250519.dist-info/METADATA,sha256=LF2uw9E7zhD2ylp4pRazX1C53VqDPN3FOO4NVrLXGe8,13889
+nv_ingest_api-2025.5.19.dev20250519.dist-info/WHEEL,sha256=Nw36Djuh_5VDukK0H78QzOX-_FQEo6V37m3nkm96gtU,91
+nv_ingest_api-2025.5.19.dev20250519.dist-info/top_level.txt,sha256=abjYMlTJGoG5tOdfIB-IWvLyKclw6HLaRSc8MxX4X6I,14
+nv_ingest_api-2025.5.19.dev20250519.dist-info/RECORD,,

{nv_ingest_api-2025.5.18.dev20250518.dist-info → nv_ingest_api-2025.5.19.dev20250519.dist-info}/WHEEL RENAMED Viewed

File without changes

{nv_ingest_api-2025.5.18.dev20250518.dist-info → nv_ingest_api-2025.5.19.dev20250519.dist-info}/licenses/LICENSE RENAMED Viewed

File without changes

{nv_ingest_api-2025.5.18.dev20250518.dist-info → nv_ingest_api-2025.5.19.dev20250519.dist-info}/top_level.txt RENAMED Viewed

File without changes

nv-ingest-api 2025.5.18.dev20250518__py3-none-any.whl → 2025.5.19.dev20250519__py3-none-any.whl

Potentially problematic release.

nv-ingest-api 2025.5.18.dev20250518py3-none-any.whl → 2025.5.19.dev20250519py3-none-any.whl