Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
3.0 KiB
3.0 KiB
ADDED Requirements
Requirement: OCR Track Table Data Structure Consistency
The OCR Track SHALL produce TableData objects with fully populated cells arrays that match the format produced by Direct Track, ensuring consistent table rendering across both processing tracks.
Scenario: OCR Track produces structured TableData for HTML tables
- GIVEN a document with tables is processed via OCR Track
- WHEN PP-StructureV3 returns HTML table content in the
htmlorcontentfield - THEN the
ocr_to_unified_converterSHALL parse the HTML and produce aTableDataobject - AND the
TableData.cellsarray SHALL containTableCellobjects for each cell - AND each
TableCellSHALL have correctrow,col, andcontentvalues - AND the output format SHALL match Direct Track's
TableDatastructure
Scenario: OCR Track handles tables with merged cells
- GIVEN an HTML table with
rowspanorcolspanattributes - WHEN the table is converted to
TableData - THEN each
TableCellSHALL have correctrow_spanandcol_spanvalues - AND the cell content SHALL be correctly extracted
Scenario: OCR Track handles header rows
- GIVEN an HTML table with
<th>elements or a header row - WHEN the table is converted to
TableData - THEN the
TableData.headersfield SHALL contain the header cell contents - AND header cells SHALL also be included in the
cellsarray
Scenario: OCR Track gracefully handles malformed HTML tables
- GIVEN an HTML table with malformed markup (missing closing tags, invalid nesting)
- WHEN parsing is attempted
- THEN the system SHALL attempt best-effort parsing using a tolerant HTML parser
- AND if parsing fails completely, SHALL fall back to returning basic
TableDatawith row/col counts - AND SHALL log a warning for debugging purposes
Scenario: PDF Generator renders OCR Track tables correctly
- GIVEN a
UnifiedDocumentfrom OCR Track containing table elements - WHEN the PDF Generator processes the document
- THEN tables SHALL be rendered as formatted tables (not as raw HTML text)
- AND the rendering SHALL be identical to Direct Track table rendering
Scenario: Direct Track table processing remains unchanged
- GIVEN a native PDF with embedded tables
- WHEN the document is processed via Direct Track
- THEN the
DirectExtractionEngineSHALL continue to produceTableDataobjects as before - AND the
ocr_to_unified_converter.pychanges SHALL NOT affect Direct Track processing - AND table rendering in PDF output SHALL be identical to pre-fix behavior
Scenario: Hybrid Mode table source isolation
- GIVEN a document processed via Hybrid Mode (Direct Track primary + OCR Track for images)
- WHEN the system merges OCR Track results into Direct Track results
- THEN only image elements (FIGURE, IMAGE, LOGO) SHALL be merged from OCR Track
- AND table elements SHALL exclusively come from Direct Track
- AND no OCR Track table data SHALL contaminate the final output