## MODIFIED Requirements ### Requirement: Enhanced OCR with Full PP-StructureV3 The system SHALL utilize the full capabilities of PP-StructureV3, extracting all element types from parsing_res_list, with proper handling of visual elements and table coordinates. #### Scenario: Extract comprehensive document structure - **WHEN** processing through OCR track - **THEN** the system SHALL use page_result.json['parsing_res_list'] - **AND** extract all element types including headers, lists, tables, figures - **AND** preserve layout_bbox coordinates for each element #### Scenario: Maintain reading order - **WHEN** extracting elements from PP-StructureV3 - **THEN** the system SHALL preserve the reading order from parsing_res_list - **AND** assign sequential indices to elements - **AND** support reordering for complex layouts #### Scenario: Extract table structure with HTML content - **WHEN** PP-StructureV3 identifies a table - **THEN** the system SHALL extract cell content and boundaries from table_res_list - **AND** extract pred_html for table HTML content - **AND** validate cell_boxes coordinates against page boundaries - **AND** apply fallback detection for invalid coordinates - **AND** preserve table HTML for structure - **AND** extract plain text for translation #### Scenario: Table matching via bbox overlap - **GIVEN** a table element from parsing_res_list without direct HTML content - **WHEN** matching against table_res_list using bbox overlap - **AND** overlap ratio exceeds 10% - **THEN** the system SHALL extract both cell_box_list and pred_html from the matched table_res - **AND** set element['html'] to the extracted pred_html - **AND** set element['extracted_text'] from the HTML content - **AND** log the successful extraction #### Scenario: Extract visual elements with paths - **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM) - **THEN** the system SHALL preserve saved_path for each element - **AND** include image dimensions and format - **AND** enable image embedding in output PDF ## ADDED Requirements ### Requirement: OCR Track PDF Coordinate System The system SHALL generate PDF output for OCR Track using the OCR coordinate system dimensions to ensure accurate text sizing and positioning. #### Scenario: PDF page size matches OCR coordinate system - **GIVEN** an OCR track processing task - **WHEN** generating the output PDF - **THEN** the system SHALL use the OCR image dimensions as PDF page size - **AND** set scale factors to 1.0 (no scaling) - **AND** preserve original bbox coordinates without transformation #### Scenario: Text font size calculation without scaling - **GIVEN** a text element with bbox height H in OCR coordinates - **WHEN** rendering text in PDF - **THEN** the system SHALL calculate font size based directly on bbox height - **AND** NOT apply additional scaling factors - **AND** ensure readable text output #### Scenario: Direct Track PDF maintains original size - **GIVEN** a direct track processing task - **WHEN** generating the output PDF - **THEN** the system SHALL use the original PDF page dimensions - **AND** preserve existing coordinate transformation logic - **AND** NOT be affected by OCR Track coordinate changes ### Requirement: Table Cell Quality Assessment The system SHALL assess table cell_boxes quality with appropriate thresholds to avoid filtering valid tables. #### Scenario: Cell density threshold - **GIVEN** a table with cell_boxes from PP-StructureV3 - **WHEN** cell density exceeds 5.0 cells per 10,000 px² - **THEN** the system SHALL flag the table as potentially over-detected - **AND** log the specific density value for debugging #### Scenario: Average cell area threshold - **GIVEN** a table with cell_boxes - **WHEN** average cell area is less than 2,000 px² - **THEN** the system SHALL flag the table as potentially over-detected - **AND** log the specific area value for debugging #### Scenario: Valid tables with normal metrics - **GIVEN** a table with density < 5.0 cells/10000px² and avg area > 2000px² - **WHEN** quality assessment is applied - **THEN** the table SHALL be considered valid - **AND** cell_boxes SHALL be used for rendering - **AND** table content SHALL be displayed in PDF output