## ADDED Requirements ### Requirement: OCR Track Table Data Structure Consistency The OCR Track SHALL produce `TableData` objects with fully populated `cells` arrays that match the format produced by Direct Track, ensuring consistent table rendering across both processing tracks. #### Scenario: OCR Track produces structured TableData for HTML tables - **GIVEN** a document with tables is processed via OCR Track - **WHEN** PP-StructureV3 returns HTML table content in the `html` or `content` field - **THEN** the `ocr_to_unified_converter` SHALL parse the HTML and produce a `TableData` object - **AND** the `TableData.cells` array SHALL contain `TableCell` objects for each cell - **AND** each `TableCell` SHALL have correct `row`, `col`, and `content` values - **AND** the output format SHALL match Direct Track's `TableData` structure #### Scenario: OCR Track handles tables with merged cells - **GIVEN** an HTML table with `rowspan` or `colspan` attributes - **WHEN** the table is converted to `TableData` - **THEN** each `TableCell` SHALL have correct `row_span` and `col_span` values - **AND** the cell content SHALL be correctly extracted #### Scenario: OCR Track handles header rows - **GIVEN** an HTML table with `` elements or a header row - **WHEN** the table is converted to `TableData` - **THEN** the `TableData.headers` field SHALL contain the header cell contents - **AND** header cells SHALL also be included in the `cells` array #### Scenario: OCR Track gracefully handles malformed HTML tables - **GIVEN** an HTML table with malformed markup (missing closing tags, invalid nesting) - **WHEN** parsing is attempted - **THEN** the system SHALL attempt best-effort parsing using a tolerant HTML parser - **AND** if parsing fails completely, SHALL fall back to returning basic `TableData` with row/col counts - **AND** SHALL log a warning for debugging purposes #### Scenario: PDF Generator renders OCR Track tables correctly - **GIVEN** a `UnifiedDocument` from OCR Track containing table elements - **WHEN** the PDF Generator processes the document - **THEN** tables SHALL be rendered as formatted tables (not as raw HTML text) - **AND** the rendering SHALL be identical to Direct Track table rendering #### Scenario: Direct Track table processing remains unchanged - **GIVEN** a native PDF with embedded tables - **WHEN** the document is processed via Direct Track - **THEN** the `DirectExtractionEngine` SHALL continue to produce `TableData` objects as before - **AND** the `ocr_to_unified_converter.py` changes SHALL NOT affect Direct Track processing - **AND** table rendering in PDF output SHALL be identical to pre-fix behavior #### Scenario: Hybrid Mode table source isolation - **GIVEN** a document processed via Hybrid Mode (Direct Track primary + OCR Track for images) - **WHEN** the system merges OCR Track results into Direct Track results - **THEN** only image elements (FIGURE, IMAGE, LOGO) SHALL be merged from OCR Track - **AND** table elements SHALL exclusively come from Direct Track - **AND** no OCR Track table data SHALL contaminate the final output