chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,91 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Enhanced OCR with Full PP-StructureV3
|
||||
|
||||
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all element types from parsing_res_list, with proper handling of visual elements and table coordinates.
|
||||
|
||||
#### Scenario: Extract comprehensive document structure
|
||||
- **WHEN** processing through OCR track
|
||||
- **THEN** the system SHALL use page_result.json['parsing_res_list']
|
||||
- **AND** extract all element types including headers, lists, tables, figures
|
||||
- **AND** preserve layout_bbox coordinates for each element
|
||||
|
||||
#### Scenario: Maintain reading order
|
||||
- **WHEN** extracting elements from PP-StructureV3
|
||||
- **THEN** the system SHALL preserve the reading order from parsing_res_list
|
||||
- **AND** assign sequential indices to elements
|
||||
- **AND** support reordering for complex layouts
|
||||
|
||||
#### Scenario: Extract table structure with HTML content
|
||||
- **WHEN** PP-StructureV3 identifies a table
|
||||
- **THEN** the system SHALL extract cell content and boundaries from table_res_list
|
||||
- **AND** extract pred_html for table HTML content
|
||||
- **AND** validate cell_boxes coordinates against page boundaries
|
||||
- **AND** apply fallback detection for invalid coordinates
|
||||
- **AND** preserve table HTML for structure
|
||||
- **AND** extract plain text for translation
|
||||
|
||||
#### Scenario: Table matching via bbox overlap
|
||||
- **GIVEN** a table element from parsing_res_list without direct HTML content
|
||||
- **WHEN** matching against table_res_list using bbox overlap
|
||||
- **AND** overlap ratio exceeds 10%
|
||||
- **THEN** the system SHALL extract both cell_box_list and pred_html from the matched table_res
|
||||
- **AND** set element['html'] to the extracted pred_html
|
||||
- **AND** set element['extracted_text'] from the HTML content
|
||||
- **AND** log the successful extraction
|
||||
|
||||
#### Scenario: Extract visual elements with paths
|
||||
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
|
||||
- **THEN** the system SHALL preserve saved_path for each element
|
||||
- **AND** include image dimensions and format
|
||||
- **AND** enable image embedding in output PDF
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: OCR Track PDF Coordinate System
|
||||
|
||||
The system SHALL generate PDF output for OCR Track using the OCR coordinate system dimensions to ensure accurate text sizing and positioning.
|
||||
|
||||
#### Scenario: PDF page size matches OCR coordinate system
|
||||
- **GIVEN** an OCR track processing task
|
||||
- **WHEN** generating the output PDF
|
||||
- **THEN** the system SHALL use the OCR image dimensions as PDF page size
|
||||
- **AND** set scale factors to 1.0 (no scaling)
|
||||
- **AND** preserve original bbox coordinates without transformation
|
||||
|
||||
#### Scenario: Text font size calculation without scaling
|
||||
- **GIVEN** a text element with bbox height H in OCR coordinates
|
||||
- **WHEN** rendering text in PDF
|
||||
- **THEN** the system SHALL calculate font size based directly on bbox height
|
||||
- **AND** NOT apply additional scaling factors
|
||||
- **AND** ensure readable text output
|
||||
|
||||
#### Scenario: Direct Track PDF maintains original size
|
||||
- **GIVEN** a direct track processing task
|
||||
- **WHEN** generating the output PDF
|
||||
- **THEN** the system SHALL use the original PDF page dimensions
|
||||
- **AND** preserve existing coordinate transformation logic
|
||||
- **AND** NOT be affected by OCR Track coordinate changes
|
||||
|
||||
### Requirement: Table Cell Quality Assessment
|
||||
|
||||
The system SHALL assess table cell_boxes quality with appropriate thresholds to avoid filtering valid tables.
|
||||
|
||||
#### Scenario: Cell density threshold
|
||||
- **GIVEN** a table with cell_boxes from PP-StructureV3
|
||||
- **WHEN** cell density exceeds 5.0 cells per 10,000 px²
|
||||
- **THEN** the system SHALL flag the table as potentially over-detected
|
||||
- **AND** log the specific density value for debugging
|
||||
|
||||
#### Scenario: Average cell area threshold
|
||||
- **GIVEN** a table with cell_boxes
|
||||
- **WHEN** average cell area is less than 2,000 px²
|
||||
- **THEN** the system SHALL flag the table as potentially over-detected
|
||||
- **AND** log the specific area value for debugging
|
||||
|
||||
#### Scenario: Valid tables with normal metrics
|
||||
- **GIVEN** a table with density < 5.0 cells/10000px² and avg area > 2000px²
|
||||
- **WHEN** quality assessment is applied
|
||||
- **THEN** the table SHALL be considered valid
|
||||
- **AND** cell_boxes SHALL be used for rendering
|
||||
- **AND** table content SHALL be displayed in PDF output
|
||||
Reference in New Issue
Block a user