test
This commit is contained in:
@@ -0,0 +1,151 @@
|
||||
# document-processing Specification Delta
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Table Cell Merging Detection
|
||||
The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents.
|
||||
|
||||
#### Scenario: Detect merged cells in Direct Track
|
||||
- **WHEN** extracting tables from an editable PDF using Direct Track
|
||||
- **THEN** the system SHALL use PyMuPDF find_tables() API
|
||||
- **AND** correctly identify cells with rowspan > 1 or colspan > 1
|
||||
- **AND** preserve merge information in UnifiedDocument table structure
|
||||
- **AND** skip placeholder cells that are covered by merged cells
|
||||
|
||||
#### Scenario: Handle complex table structures
|
||||
- **WHEN** processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges)
|
||||
- **THEN** the system SHALL NOT split merged cells into individual cells
|
||||
- **AND** the output cell count SHALL match the actual visual cell count
|
||||
- **AND** the rendered PDF SHALL display correct merged cell boundaries
|
||||
|
||||
### Requirement: Visual Element Path Preservation
|
||||
The system SHALL preserve image paths for all visual element types during OCR conversion.
|
||||
|
||||
#### Scenario: Preserve CHART element paths
|
||||
- **WHEN** converting PP-StructureV3 output containing CHART elements
|
||||
- **THEN** the system SHALL treat CHART as a visual element type
|
||||
- **AND** extract saved_path from the element data
|
||||
- **AND** include saved_path in the UnifiedDocument content field
|
||||
|
||||
#### Scenario: Support all visual element types
|
||||
- **WHEN** processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP
|
||||
- **THEN** the system SHALL extract saved_path or img_path for each element
|
||||
- **AND** preserve path, width, height, and format in content dictionary
|
||||
- **AND** enable downstream PDF generation to embed these images
|
||||
|
||||
#### Scenario: Fallback path resolution
|
||||
- **WHEN** a visual element has multiple path fields (saved_path, img_path)
|
||||
- **THEN** the system SHALL prefer saved_path over img_path
|
||||
- **AND** fallback to img_path if saved_path is missing
|
||||
- **AND** log warning if both paths are missing
|
||||
|
||||
### Requirement: Cell Box Coordinate Validation
|
||||
The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases.
|
||||
|
||||
#### Scenario: Detect out-of-bounds coordinates
|
||||
- **WHEN** processing cell_boxes from PP-StructureV3
|
||||
- **THEN** the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height)
|
||||
- **AND** log tables with coordinates exceeding page bounds
|
||||
- **AND** mark affected cells for fallback processing
|
||||
|
||||
#### Scenario: Apply CV line detection fallback
|
||||
- **WHEN** cell_boxes coordinates are invalid (out of bounds)
|
||||
- **THEN** the system SHALL apply OpenCV line detection as fallback
|
||||
- **AND** reconstruct table structure from detected lines
|
||||
- **AND** include fallback_used flag in table metadata
|
||||
|
||||
#### Scenario: Coordinate normalization
|
||||
- **WHEN** coordinates are within page bounds but slightly outside table bbox
|
||||
- **THEN** the system SHALL clamp coordinates to table boundaries
|
||||
- **AND** preserve relative cell positions
|
||||
- **AND** ensure no cells overlap after normalization
|
||||
|
||||
### Requirement: Decoration Image Filtering
|
||||
The system SHALL filter out minimal decoration images that do not contribute meaningful content.
|
||||
|
||||
#### Scenario: Filter tiny images by area
|
||||
- **WHEN** extracting images from a document
|
||||
- **THEN** the system SHALL calculate image area (width x height)
|
||||
- **AND** filter out images with area < 200 square pixels
|
||||
- **AND** log filtered image count for debugging
|
||||
|
||||
#### Scenario: Configurable filtering threshold
|
||||
- **WHEN** processing documents with intentionally small images
|
||||
- **THEN** the system SHALL support configuration of minimum image area threshold
|
||||
- **AND** default to 200 square pixels if not specified
|
||||
- **AND** allow threshold = 0 to disable filtering
|
||||
|
||||
### Requirement: Covering Image Removal
|
||||
The system SHALL remove covering/redaction images from the final output.
|
||||
|
||||
#### Scenario: Detect covering rectangles
|
||||
- **WHEN** preprocessing a PDF page
|
||||
- **THEN** the system SHALL detect black/white rectangles covering text regions
|
||||
- **AND** identify covering images by high IoU (> 0.8) with underlying content
|
||||
- **AND** mark covering images for exclusion
|
||||
|
||||
#### Scenario: Exclude covering images from rendering
|
||||
- **WHEN** generating output PDF
|
||||
- **THEN** the system SHALL exclude images marked as covering
|
||||
- **AND** preserve the text content that was covered
|
||||
- **AND** include covering_images_removed count in metadata
|
||||
|
||||
#### Scenario: Handle both black and white covering
|
||||
- **WHEN** detecting covering rectangles
|
||||
- **THEN** the system SHALL detect both black fill (redaction style)
|
||||
- **AND** white fill (whiteout style)
|
||||
- **AND** low-contrast rectangles intended to hide content
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Enhanced OCR with Full PP-StructureV3
|
||||
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
|
||||
|
||||
#### Scenario: Extract comprehensive document structure
|
||||
- **WHEN** processing through OCR track
|
||||
- **THEN** the system SHALL use page_result.json['parsing_res_list']
|
||||
- **AND** extract all element types including headers, lists, tables, figures
|
||||
- **AND** preserve layout_bbox coordinates for each element
|
||||
|
||||
#### Scenario: Maintain reading order
|
||||
- **WHEN** extracting elements from PP-StructureV3
|
||||
- **THEN** the system SHALL preserve the reading order from parsing_res_list
|
||||
- **AND** assign sequential indices to elements
|
||||
- **AND** support reordering for complex layouts
|
||||
|
||||
#### Scenario: Extract table structure
|
||||
- **WHEN** PP-StructureV3 identifies a table
|
||||
- **THEN** the system SHALL extract cell content and boundaries
|
||||
- **AND** validate cell_boxes coordinates against page boundaries
|
||||
- **AND** apply fallback detection for invalid coordinates
|
||||
- **AND** preserve table HTML for structure
|
||||
- **AND** extract plain text for translation
|
||||
|
||||
#### Scenario: Extract visual elements with paths
|
||||
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
|
||||
- **THEN** the system SHALL preserve saved_path for each element
|
||||
- **AND** include image dimensions and format
|
||||
- **AND** enable image embedding in output PDF
|
||||
|
||||
### Requirement: Generate UnifiedDocument from direct extraction
|
||||
The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
|
||||
|
||||
#### Scenario: Extract tables with cell merging
|
||||
- **WHEN** direct extraction encounters a table
|
||||
- **THEN** the system SHALL use PyMuPDF find_tables() API
|
||||
- **AND** extract cell content with correct rowspan/colspan
|
||||
- **AND** preserve merged cell boundaries
|
||||
- **AND** skip placeholder cells covered by merges
|
||||
|
||||
#### Scenario: Filter decoration images
|
||||
- **WHEN** extracting images from PDF
|
||||
- **THEN** the system SHALL filter images smaller than minimum area threshold
|
||||
- **AND** exclude covering/redaction images
|
||||
- **AND** preserve meaningful content images
|
||||
|
||||
#### Scenario: Preserve text styling with image handling
|
||||
- **WHEN** direct extraction completes
|
||||
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
|
||||
- **AND** preserve text styling, fonts, and exact positioning
|
||||
- **AND** extract tables with cell boundaries, content, and merge info
|
||||
- **AND** include only meaningful images in output
|
||||
Reference in New Issue
Block a user