fix: OCR track table data format and image cropping

Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 18:48:15 +08:00
parent a227311b2d
commit 6e050eb540
8 changed files with 585 additions and 30 deletions
--- a/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/specs/ocr-processing/spec.md
+++ b/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/specs/ocr-processing/spec.md
@@ -0,0 +1,51 @@
+## ADDED Requirements
+
+### Requirement: OCR Track Table Data Structure Consistency
+The OCR Track SHALL produce `TableData` objects with fully populated `cells` arrays that match the format produced by Direct Track, ensuring consistent table rendering across both processing tracks.
+
+#### Scenario: OCR Track produces structured TableData for HTML tables
+- **GIVEN** a document with tables is processed via OCR Track
+- **WHEN** PP-StructureV3 returns HTML table content in the `html` or `content` field
+- **THEN** the `ocr_to_unified_converter` SHALL parse the HTML and produce a `TableData` object
+- **AND** the `TableData.cells` array SHALL contain `TableCell` objects for each cell
+- **AND** each `TableCell` SHALL have correct `row`, `col`, and `content` values
+- **AND** the output format SHALL match Direct Track's `TableData` structure
+
+#### Scenario: OCR Track handles tables with merged cells
+- **GIVEN** an HTML table with `rowspan` or `colspan` attributes
+- **WHEN** the table is converted to `TableData`
+- **THEN** each `TableCell` SHALL have correct `row_span` and `col_span` values
+- **AND** the cell content SHALL be correctly extracted
+
+#### Scenario: OCR Track handles header rows
+- **GIVEN** an HTML table with `<th>` elements or a header row
+- **WHEN** the table is converted to `TableData`
+- **THEN** the `TableData.headers` field SHALL contain the header cell contents
+- **AND** header cells SHALL also be included in the `cells` array
+
+#### Scenario: OCR Track gracefully handles malformed HTML tables
+- **GIVEN** an HTML table with malformed markup (missing closing tags, invalid nesting)
+- **WHEN** parsing is attempted
+- **THEN** the system SHALL attempt best-effort parsing using a tolerant HTML parser
+- **AND** if parsing fails completely, SHALL fall back to returning basic `TableData` with row/col counts
+- **AND** SHALL log a warning for debugging purposes
+
+#### Scenario: PDF Generator renders OCR Track tables correctly
+- **GIVEN** a `UnifiedDocument` from OCR Track containing table elements
+- **WHEN** the PDF Generator processes the document
+- **THEN** tables SHALL be rendered as formatted tables (not as raw HTML text)
+- **AND** the rendering SHALL be identical to Direct Track table rendering
+
+#### Scenario: Direct Track table processing remains unchanged
+- **GIVEN** a native PDF with embedded tables
+- **WHEN** the document is processed via Direct Track
+- **THEN** the `DirectExtractionEngine` SHALL continue to produce `TableData` objects as before
+- **AND** the `ocr_to_unified_converter.py` changes SHALL NOT affect Direct Track processing
+- **AND** table rendering in PDF output SHALL be identical to pre-fix behavior
+
+#### Scenario: Hybrid Mode table source isolation
+- **GIVEN** a document processed via Hybrid Mode (Direct Track primary + OCR Track for images)
+- **WHEN** the system merges OCR Track results into Direct Track results
+- **THEN** only image elements (FIGURE, IMAGE, LOGO) SHALL be merged from OCR Track
+- **AND** table elements SHALL exclusively come from Direct Track
+- **AND** no OCR Track table data SHALL contaminate the final output