# document-processing Specification Delta ## ADDED Requirements ### Requirement: Table Cell Merging Detection The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents. #### Scenario: Detect merged cells in Direct Track - **WHEN** extracting tables from an editable PDF using Direct Track - **THEN** the system SHALL use PyMuPDF find_tables() API - **AND** correctly identify cells with rowspan > 1 or colspan > 1 - **AND** preserve merge information in UnifiedDocument table structure - **AND** skip placeholder cells that are covered by merged cells #### Scenario: Handle complex table structures - **WHEN** processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges) - **THEN** the system SHALL NOT split merged cells into individual cells - **AND** the output cell count SHALL match the actual visual cell count - **AND** the rendered PDF SHALL display correct merged cell boundaries ### Requirement: Visual Element Path Preservation The system SHALL preserve image paths for all visual element types during OCR conversion. #### Scenario: Preserve CHART element paths - **WHEN** converting PP-StructureV3 output containing CHART elements - **THEN** the system SHALL treat CHART as a visual element type - **AND** extract saved_path from the element data - **AND** include saved_path in the UnifiedDocument content field #### Scenario: Support all visual element types - **WHEN** processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP - **THEN** the system SHALL extract saved_path or img_path for each element - **AND** preserve path, width, height, and format in content dictionary - **AND** enable downstream PDF generation to embed these images #### Scenario: Fallback path resolution - **WHEN** a visual element has multiple path fields (saved_path, img_path) - **THEN** the system SHALL prefer saved_path over img_path - **AND** fallback to img_path if saved_path is missing - **AND** log warning if both paths are missing ### Requirement: Cell Box Coordinate Validation The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases. #### Scenario: Detect out-of-bounds coordinates - **WHEN** processing cell_boxes from PP-StructureV3 - **THEN** the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height) - **AND** log tables with coordinates exceeding page bounds - **AND** mark affected cells for fallback processing #### Scenario: Apply CV line detection fallback - **WHEN** cell_boxes coordinates are invalid (out of bounds) - **THEN** the system SHALL apply OpenCV line detection as fallback - **AND** reconstruct table structure from detected lines - **AND** include fallback_used flag in table metadata #### Scenario: Coordinate normalization - **WHEN** coordinates are within page bounds but slightly outside table bbox - **THEN** the system SHALL clamp coordinates to table boundaries - **AND** preserve relative cell positions - **AND** ensure no cells overlap after normalization ### Requirement: Decoration Image Filtering The system SHALL filter out minimal decoration images that do not contribute meaningful content. #### Scenario: Filter tiny images by area - **WHEN** extracting images from a document - **THEN** the system SHALL calculate image area (width x height) - **AND** filter out images with area < 200 square pixels - **AND** log filtered image count for debugging #### Scenario: Configurable filtering threshold - **WHEN** processing documents with intentionally small images - **THEN** the system SHALL support configuration of minimum image area threshold - **AND** default to 200 square pixels if not specified - **AND** allow threshold = 0 to disable filtering ### Requirement: Covering Image Removal The system SHALL remove covering/redaction images from the final output. #### Scenario: Detect covering rectangles - **WHEN** preprocessing a PDF page - **THEN** the system SHALL detect black/white rectangles covering text regions - **AND** identify covering images by high IoU (> 0.8) with underlying content - **AND** mark covering images for exclusion #### Scenario: Exclude covering images from rendering - **WHEN** generating output PDF - **THEN** the system SHALL exclude images marked as covering - **AND** preserve the text content that was covered - **AND** include covering_images_removed count in metadata #### Scenario: Handle both black and white covering - **WHEN** detecting covering rectangles - **THEN** the system SHALL detect both black fill (redaction style) - **AND** white fill (whiteout style) - **AND** low-contrast rectangles intended to hide content ## MODIFIED Requirements ### Requirement: Enhanced OCR with Full PP-StructureV3 The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates. #### Scenario: Extract comprehensive document structure - **WHEN** processing through OCR track - **THEN** the system SHALL use page_result.json['parsing_res_list'] - **AND** extract all element types including headers, lists, tables, figures - **AND** preserve layout_bbox coordinates for each element #### Scenario: Maintain reading order - **WHEN** extracting elements from PP-StructureV3 - **THEN** the system SHALL preserve the reading order from parsing_res_list - **AND** assign sequential indices to elements - **AND** support reordering for complex layouts #### Scenario: Extract table structure - **WHEN** PP-StructureV3 identifies a table - **THEN** the system SHALL extract cell content and boundaries - **AND** validate cell_boxes coordinates against page boundaries - **AND** apply fallback detection for invalid coordinates - **AND** preserve table HTML for structure - **AND** extract plain text for translation #### Scenario: Extract visual elements with paths - **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM) - **THEN** the system SHALL preserve saved_path for each element - **AND** include image dimensions and format - **AND** enable image embedding in output PDF ## ADDED Requirements ### Requirement: Generate UnifiedDocument from direct extraction The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging. #### Scenario: Extract tables with cell merging - **WHEN** direct extraction encounters a table - **THEN** the system SHALL use PyMuPDF find_tables() API - **AND** extract cell content with correct rowspan/colspan - **AND** preserve merged cell boundaries - **AND** skip placeholder cells covered by merges #### Scenario: Filter decoration images - **WHEN** extracting images from PDF - **THEN** the system SHALL filter images smaller than minimum area threshold - **AND** exclude covering/redaction images - **AND** preserve meaningful content images #### Scenario: Preserve text styling with image handling - **WHEN** direct extraction completes - **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument - **AND** preserve text styling, fonts, and exact positioning - **AND** extract tables with cell boundaries, content, and merge info - **AND** include only meaningful images in output