test

2025-12-04 18:00:37 +08:00
parent 9437387ef1
commit 8265be1741
22 changed files with 2672 additions and 196 deletions
--- a/openspec/changes/refactor-dual-track-architecture/specs/document-processing/spec.md
+++ b/openspec/changes/refactor-dual-track-architecture/specs/document-processing/spec.md
@@ -0,0 +1,151 @@
+# document-processing Specification Delta
+
+## ADDED Requirements
+
+### Requirement: Table Cell Merging Detection
+The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents.
+
+#### Scenario: Detect merged cells in Direct Track
+- **WHEN** extracting tables from an editable PDF using Direct Track
+- **THEN** the system SHALL use PyMuPDF find_tables() API
+- **AND** correctly identify cells with rowspan > 1 or colspan > 1
+- **AND** preserve merge information in UnifiedDocument table structure
+- **AND** skip placeholder cells that are covered by merged cells
+
+#### Scenario: Handle complex table structures
+- **WHEN** processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges)
+- **THEN** the system SHALL NOT split merged cells into individual cells
+- **AND** the output cell count SHALL match the actual visual cell count
+- **AND** the rendered PDF SHALL display correct merged cell boundaries
+
+### Requirement: Visual Element Path Preservation
+The system SHALL preserve image paths for all visual element types during OCR conversion.
+
+#### Scenario: Preserve CHART element paths
+- **WHEN** converting PP-StructureV3 output containing CHART elements
+- **THEN** the system SHALL treat CHART as a visual element type
+- **AND** extract saved_path from the element data
+- **AND** include saved_path in the UnifiedDocument content field
+
+#### Scenario: Support all visual element types
+- **WHEN** processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP
+- **THEN** the system SHALL extract saved_path or img_path for each element
+- **AND** preserve path, width, height, and format in content dictionary
+- **AND** enable downstream PDF generation to embed these images
+
+#### Scenario: Fallback path resolution
+- **WHEN** a visual element has multiple path fields (saved_path, img_path)
+- **THEN** the system SHALL prefer saved_path over img_path
+- **AND** fallback to img_path if saved_path is missing
+- **AND** log warning if both paths are missing
+
+### Requirement: Cell Box Coordinate Validation
+The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases.
+
+#### Scenario: Detect out-of-bounds coordinates
+- **WHEN** processing cell_boxes from PP-StructureV3
+- **THEN** the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height)
+- **AND** log tables with coordinates exceeding page bounds
+- **AND** mark affected cells for fallback processing
+
+#### Scenario: Apply CV line detection fallback
+- **WHEN** cell_boxes coordinates are invalid (out of bounds)
+- **THEN** the system SHALL apply OpenCV line detection as fallback
+- **AND** reconstruct table structure from detected lines
+- **AND** include fallback_used flag in table metadata
+
+#### Scenario: Coordinate normalization
+- **WHEN** coordinates are within page bounds but slightly outside table bbox
+- **THEN** the system SHALL clamp coordinates to table boundaries
+- **AND** preserve relative cell positions
+- **AND** ensure no cells overlap after normalization
+
+### Requirement: Decoration Image Filtering
+The system SHALL filter out minimal decoration images that do not contribute meaningful content.
+
+#### Scenario: Filter tiny images by area
+- **WHEN** extracting images from a document
+- **THEN** the system SHALL calculate image area (width x height)
+- **AND** filter out images with area < 200 square pixels
+- **AND** log filtered image count for debugging
+
+#### Scenario: Configurable filtering threshold
+- **WHEN** processing documents with intentionally small images
+- **THEN** the system SHALL support configuration of minimum image area threshold
+- **AND** default to 200 square pixels if not specified
+- **AND** allow threshold = 0 to disable filtering
+
+### Requirement: Covering Image Removal
+The system SHALL remove covering/redaction images from the final output.
+
+#### Scenario: Detect covering rectangles
+- **WHEN** preprocessing a PDF page
+- **THEN** the system SHALL detect black/white rectangles covering text regions
+- **AND** identify covering images by high IoU (> 0.8) with underlying content
+- **AND** mark covering images for exclusion
+
+#### Scenario: Exclude covering images from rendering
+- **WHEN** generating output PDF
+- **THEN** the system SHALL exclude images marked as covering
+- **AND** preserve the text content that was covered
+- **AND** include covering_images_removed count in metadata
+
+#### Scenario: Handle both black and white covering
+- **WHEN** detecting covering rectangles
+- **THEN** the system SHALL detect both black fill (redaction style)
+- **AND** white fill (whiteout style)
+- **AND** low-contrast rectangles intended to hide content
+
+## MODIFIED Requirements
+
+### Requirement: Enhanced OCR with Full PP-StructureV3
+The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
+
+#### Scenario: Extract comprehensive document structure
+- **WHEN** processing through OCR track
+- **THEN** the system SHALL use page_result.json['parsing_res_list']
+- **AND** extract all element types including headers, lists, tables, figures
+- **AND** preserve layout_bbox coordinates for each element
+
+#### Scenario: Maintain reading order
+- **WHEN** extracting elements from PP-StructureV3
+- **THEN** the system SHALL preserve the reading order from parsing_res_list
+- **AND** assign sequential indices to elements
+- **AND** support reordering for complex layouts
+
+#### Scenario: Extract table structure
+- **WHEN** PP-StructureV3 identifies a table
+- **THEN** the system SHALL extract cell content and boundaries
+- **AND** validate cell_boxes coordinates against page boundaries
+- **AND** apply fallback detection for invalid coordinates
+- **AND** preserve table HTML for structure
+- **AND** extract plain text for translation
+
+#### Scenario: Extract visual elements with paths
+- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
+- **THEN** the system SHALL preserve saved_path for each element
+- **AND** include image dimensions and format
+- **AND** enable image embedding in output PDF
+
+### Requirement: Generate UnifiedDocument from direct extraction
+The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
+
+#### Scenario: Extract tables with cell merging
+- **WHEN** direct extraction encounters a table
+- **THEN** the system SHALL use PyMuPDF find_tables() API
+- **AND** extract cell content with correct rowspan/colspan
+- **AND** preserve merged cell boundaries
+- **AND** skip placeholder cells covered by merges
+
+#### Scenario: Filter decoration images
+- **WHEN** extracting images from PDF
+- **THEN** the system SHALL filter images smaller than minimum area threshold
+- **AND** exclude covering/redaction images
+- **AND** preserve meaningful content images
+
+#### Scenario: Preserve text styling with image handling
+- **WHEN** direct extraction completes
+- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
+- **AND** preserve text styling, fonts, and exact positioning
+- **AND** extract tables with cell boundaries, content, and merge info
+- **AND** include only meaningful images in output