Files
OCR/openspec/changes/refactor-dual-track-architecture/specs/document-processing/spec.md
2025-12-04 18:00:37 +08:00

152 lines
7.1 KiB
Markdown

# document-processing Specification Delta
## ADDED Requirements
### Requirement: Table Cell Merging Detection
The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents.
#### Scenario: Detect merged cells in Direct Track
- **WHEN** extracting tables from an editable PDF using Direct Track
- **THEN** the system SHALL use PyMuPDF find_tables() API
- **AND** correctly identify cells with rowspan > 1 or colspan > 1
- **AND** preserve merge information in UnifiedDocument table structure
- **AND** skip placeholder cells that are covered by merged cells
#### Scenario: Handle complex table structures
- **WHEN** processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges)
- **THEN** the system SHALL NOT split merged cells into individual cells
- **AND** the output cell count SHALL match the actual visual cell count
- **AND** the rendered PDF SHALL display correct merged cell boundaries
### Requirement: Visual Element Path Preservation
The system SHALL preserve image paths for all visual element types during OCR conversion.
#### Scenario: Preserve CHART element paths
- **WHEN** converting PP-StructureV3 output containing CHART elements
- **THEN** the system SHALL treat CHART as a visual element type
- **AND** extract saved_path from the element data
- **AND** include saved_path in the UnifiedDocument content field
#### Scenario: Support all visual element types
- **WHEN** processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP
- **THEN** the system SHALL extract saved_path or img_path for each element
- **AND** preserve path, width, height, and format in content dictionary
- **AND** enable downstream PDF generation to embed these images
#### Scenario: Fallback path resolution
- **WHEN** a visual element has multiple path fields (saved_path, img_path)
- **THEN** the system SHALL prefer saved_path over img_path
- **AND** fallback to img_path if saved_path is missing
- **AND** log warning if both paths are missing
### Requirement: Cell Box Coordinate Validation
The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases.
#### Scenario: Detect out-of-bounds coordinates
- **WHEN** processing cell_boxes from PP-StructureV3
- **THEN** the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height)
- **AND** log tables with coordinates exceeding page bounds
- **AND** mark affected cells for fallback processing
#### Scenario: Apply CV line detection fallback
- **WHEN** cell_boxes coordinates are invalid (out of bounds)
- **THEN** the system SHALL apply OpenCV line detection as fallback
- **AND** reconstruct table structure from detected lines
- **AND** include fallback_used flag in table metadata
#### Scenario: Coordinate normalization
- **WHEN** coordinates are within page bounds but slightly outside table bbox
- **THEN** the system SHALL clamp coordinates to table boundaries
- **AND** preserve relative cell positions
- **AND** ensure no cells overlap after normalization
### Requirement: Decoration Image Filtering
The system SHALL filter out minimal decoration images that do not contribute meaningful content.
#### Scenario: Filter tiny images by area
- **WHEN** extracting images from a document
- **THEN** the system SHALL calculate image area (width x height)
- **AND** filter out images with area < 200 square pixels
- **AND** log filtered image count for debugging
#### Scenario: Configurable filtering threshold
- **WHEN** processing documents with intentionally small images
- **THEN** the system SHALL support configuration of minimum image area threshold
- **AND** default to 200 square pixels if not specified
- **AND** allow threshold = 0 to disable filtering
### Requirement: Covering Image Removal
The system SHALL remove covering/redaction images from the final output.
#### Scenario: Detect covering rectangles
- **WHEN** preprocessing a PDF page
- **THEN** the system SHALL detect black/white rectangles covering text regions
- **AND** identify covering images by high IoU (> 0.8) with underlying content
- **AND** mark covering images for exclusion
#### Scenario: Exclude covering images from rendering
- **WHEN** generating output PDF
- **THEN** the system SHALL exclude images marked as covering
- **AND** preserve the text content that was covered
- **AND** include covering_images_removed count in metadata
#### Scenario: Handle both black and white covering
- **WHEN** detecting covering rectangles
- **THEN** the system SHALL detect both black fill (redaction style)
- **AND** white fill (whiteout style)
- **AND** low-contrast rectangles intended to hide content
## MODIFIED Requirements
### Requirement: Enhanced OCR with Full PP-StructureV3
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
#### Scenario: Extract comprehensive document structure
- **WHEN** processing through OCR track
- **THEN** the system SHALL use page_result.json['parsing_res_list']
- **AND** extract all element types including headers, lists, tables, figures
- **AND** preserve layout_bbox coordinates for each element
#### Scenario: Maintain reading order
- **WHEN** extracting elements from PP-StructureV3
- **THEN** the system SHALL preserve the reading order from parsing_res_list
- **AND** assign sequential indices to elements
- **AND** support reordering for complex layouts
#### Scenario: Extract table structure
- **WHEN** PP-StructureV3 identifies a table
- **THEN** the system SHALL extract cell content and boundaries
- **AND** validate cell_boxes coordinates against page boundaries
- **AND** apply fallback detection for invalid coordinates
- **AND** preserve table HTML for structure
- **AND** extract plain text for translation
#### Scenario: Extract visual elements with paths
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
- **THEN** the system SHALL preserve saved_path for each element
- **AND** include image dimensions and format
- **AND** enable image embedding in output PDF
### Requirement: Generate UnifiedDocument from direct extraction
The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
#### Scenario: Extract tables with cell merging
- **WHEN** direct extraction encounters a table
- **THEN** the system SHALL use PyMuPDF find_tables() API
- **AND** extract cell content with correct rowspan/colspan
- **AND** preserve merged cell boundaries
- **AND** skip placeholder cells covered by merges
#### Scenario: Filter decoration images
- **WHEN** extracting images from PDF
- **THEN** the system SHALL filter images smaller than minimum area threshold
- **AND** exclude covering/redaction images
- **AND** preserve meaningful content images
#### Scenario: Preserve text styling with image handling
- **WHEN** direct extraction completes
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
- **AND** preserve text styling, fonts, and exact positioning
- **AND** extract tables with cell boundaries, content, and merge info
- **AND** include only meaningful images in output