Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7.2 KiB
document-processing Specification Delta
ADDED Requirements
Requirement: Table Cell Merging Detection
The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents.
Scenario: Detect merged cells in Direct Track
- WHEN extracting tables from an editable PDF using Direct Track
- THEN the system SHALL use PyMuPDF find_tables() API
- AND correctly identify cells with rowspan > 1 or colspan > 1
- AND preserve merge information in UnifiedDocument table structure
- AND skip placeholder cells that are covered by merged cells
Scenario: Handle complex table structures
- WHEN processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges)
- THEN the system SHALL NOT split merged cells into individual cells
- AND the output cell count SHALL match the actual visual cell count
- AND the rendered PDF SHALL display correct merged cell boundaries
Requirement: Visual Element Path Preservation
The system SHALL preserve image paths for all visual element types during OCR conversion.
Scenario: Preserve CHART element paths
- WHEN converting PP-StructureV3 output containing CHART elements
- THEN the system SHALL treat CHART as a visual element type
- AND extract saved_path from the element data
- AND include saved_path in the UnifiedDocument content field
Scenario: Support all visual element types
- WHEN processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP
- THEN the system SHALL extract saved_path or img_path for each element
- AND preserve path, width, height, and format in content dictionary
- AND enable downstream PDF generation to embed these images
Scenario: Fallback path resolution
- WHEN a visual element has multiple path fields (saved_path, img_path)
- THEN the system SHALL prefer saved_path over img_path
- AND fallback to img_path if saved_path is missing
- AND log warning if both paths are missing
Requirement: Cell Box Coordinate Validation
The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases.
Scenario: Detect out-of-bounds coordinates
- WHEN processing cell_boxes from PP-StructureV3
- THEN the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height)
- AND log tables with coordinates exceeding page bounds
- AND mark affected cells for fallback processing
Scenario: Apply CV line detection fallback
- WHEN cell_boxes coordinates are invalid (out of bounds)
- THEN the system SHALL apply OpenCV line detection as fallback
- AND reconstruct table structure from detected lines
- AND include fallback_used flag in table metadata
Scenario: Coordinate normalization
- WHEN coordinates are within page bounds but slightly outside table bbox
- THEN the system SHALL clamp coordinates to table boundaries
- AND preserve relative cell positions
- AND ensure no cells overlap after normalization
Requirement: Decoration Image Filtering
The system SHALL filter out minimal decoration images that do not contribute meaningful content.
Scenario: Filter tiny images by area
- WHEN extracting images from a document
- THEN the system SHALL calculate image area (width x height)
- AND filter out images with area < 200 square pixels
- AND log filtered image count for debugging
Scenario: Configurable filtering threshold
- WHEN processing documents with intentionally small images
- THEN the system SHALL support configuration of minimum image area threshold
- AND default to 200 square pixels if not specified
- AND allow threshold = 0 to disable filtering
Requirement: Covering Image Removal
The system SHALL remove covering/redaction images from the final output.
Scenario: Detect covering rectangles
- WHEN preprocessing a PDF page
- THEN the system SHALL detect black/white rectangles covering text regions
- AND identify covering images by high IoU (> 0.8) with underlying content
- AND mark covering images for exclusion
Scenario: Exclude covering images from rendering
- WHEN generating output PDF
- THEN the system SHALL exclude images marked as covering
- AND preserve the text content that was covered
- AND include covering_images_removed count in metadata
Scenario: Handle both black and white covering
- WHEN detecting covering rectangles
- THEN the system SHALL detect both black fill (redaction style)
- AND white fill (whiteout style)
- AND low-contrast rectangles intended to hide content
MODIFIED Requirements
Requirement: Enhanced OCR with Full PP-StructureV3
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
Scenario: Extract comprehensive document structure
- WHEN processing through OCR track
- THEN the system SHALL use page_result.json['parsing_res_list']
- AND extract all element types including headers, lists, tables, figures
- AND preserve layout_bbox coordinates for each element
Scenario: Maintain reading order
- WHEN extracting elements from PP-StructureV3
- THEN the system SHALL preserve the reading order from parsing_res_list
- AND assign sequential indices to elements
- AND support reordering for complex layouts
Scenario: Extract table structure
- WHEN PP-StructureV3 identifies a table
- THEN the system SHALL extract cell content and boundaries
- AND validate cell_boxes coordinates against page boundaries
- AND apply fallback detection for invalid coordinates
- AND preserve table HTML for structure
- AND extract plain text for translation
Scenario: Extract visual elements with paths
- WHEN PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
- THEN the system SHALL preserve saved_path for each element
- AND include image dimensions and format
- AND enable image embedding in output PDF
ADDED Requirements
Requirement: Generate UnifiedDocument from direct extraction
The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
Scenario: Extract tables with cell merging
- WHEN direct extraction encounters a table
- THEN the system SHALL use PyMuPDF find_tables() API
- AND extract cell content with correct rowspan/colspan
- AND preserve merged cell boundaries
- AND skip placeholder cells covered by merges
Scenario: Filter decoration images
- WHEN extracting images from PDF
- THEN the system SHALL filter images smaller than minimum area threshold
- AND exclude covering/redaction images
- AND preserve meaningful content images
Scenario: Preserve text styling with image handling
- WHEN direct extraction completes
- THEN the system SHALL convert PyMuPDF results to UnifiedDocument
- AND preserve text styling, fonts, and exact positioning
- AND extract tables with cell boundaries, content, and merge info
- AND include only meaningful images in output