chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
egg
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions

View File

@@ -0,0 +1,91 @@
## MODIFIED Requirements
### Requirement: Enhanced OCR with Full PP-StructureV3
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all element types from parsing_res_list, with proper handling of visual elements and table coordinates.
#### Scenario: Extract comprehensive document structure
- **WHEN** processing through OCR track
- **THEN** the system SHALL use page_result.json['parsing_res_list']
- **AND** extract all element types including headers, lists, tables, figures
- **AND** preserve layout_bbox coordinates for each element
#### Scenario: Maintain reading order
- **WHEN** extracting elements from PP-StructureV3
- **THEN** the system SHALL preserve the reading order from parsing_res_list
- **AND** assign sequential indices to elements
- **AND** support reordering for complex layouts
#### Scenario: Extract table structure with HTML content
- **WHEN** PP-StructureV3 identifies a table
- **THEN** the system SHALL extract cell content and boundaries from table_res_list
- **AND** extract pred_html for table HTML content
- **AND** validate cell_boxes coordinates against page boundaries
- **AND** apply fallback detection for invalid coordinates
- **AND** preserve table HTML for structure
- **AND** extract plain text for translation
#### Scenario: Table matching via bbox overlap
- **GIVEN** a table element from parsing_res_list without direct HTML content
- **WHEN** matching against table_res_list using bbox overlap
- **AND** overlap ratio exceeds 10%
- **THEN** the system SHALL extract both cell_box_list and pred_html from the matched table_res
- **AND** set element['html'] to the extracted pred_html
- **AND** set element['extracted_text'] from the HTML content
- **AND** log the successful extraction
#### Scenario: Extract visual elements with paths
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
- **THEN** the system SHALL preserve saved_path for each element
- **AND** include image dimensions and format
- **AND** enable image embedding in output PDF
## ADDED Requirements
### Requirement: OCR Track PDF Coordinate System
The system SHALL generate PDF output for OCR Track using the OCR coordinate system dimensions to ensure accurate text sizing and positioning.
#### Scenario: PDF page size matches OCR coordinate system
- **GIVEN** an OCR track processing task
- **WHEN** generating the output PDF
- **THEN** the system SHALL use the OCR image dimensions as PDF page size
- **AND** set scale factors to 1.0 (no scaling)
- **AND** preserve original bbox coordinates without transformation
#### Scenario: Text font size calculation without scaling
- **GIVEN** a text element with bbox height H in OCR coordinates
- **WHEN** rendering text in PDF
- **THEN** the system SHALL calculate font size based directly on bbox height
- **AND** NOT apply additional scaling factors
- **AND** ensure readable text output
#### Scenario: Direct Track PDF maintains original size
- **GIVEN** a direct track processing task
- **WHEN** generating the output PDF
- **THEN** the system SHALL use the original PDF page dimensions
- **AND** preserve existing coordinate transformation logic
- **AND** NOT be affected by OCR Track coordinate changes
### Requirement: Table Cell Quality Assessment
The system SHALL assess table cell_boxes quality with appropriate thresholds to avoid filtering valid tables.
#### Scenario: Cell density threshold
- **GIVEN** a table with cell_boxes from PP-StructureV3
- **WHEN** cell density exceeds 5.0 cells per 10,000 px²
- **THEN** the system SHALL flag the table as potentially over-detected
- **AND** log the specific density value for debugging
#### Scenario: Average cell area threshold
- **GIVEN** a table with cell_boxes
- **WHEN** average cell area is less than 2,000 px²
- **THEN** the system SHALL flag the table as potentially over-detected
- **AND** log the specific area value for debugging
#### Scenario: Valid tables with normal metrics
- **GIVEN** a table with density < 5.0 cells/10000px² and avg area > 2000px²
- **WHEN** quality assessment is applied
- **THEN** the table SHALL be considered valid
- **AND** cell_boxes SHALL be used for rendering
- **AND** table content SHALL be displayed in PDF output