Files
egg 6e050eb540 fix: OCR track table data format and image cropping
Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays

Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern

Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)

Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 18:48:15 +08:00

3.0 KiB

ADDED Requirements

Requirement: OCR Track Table Data Structure Consistency

The OCR Track SHALL produce TableData objects with fully populated cells arrays that match the format produced by Direct Track, ensuring consistent table rendering across both processing tracks.

Scenario: OCR Track produces structured TableData for HTML tables

  • GIVEN a document with tables is processed via OCR Track
  • WHEN PP-StructureV3 returns HTML table content in the html or content field
  • THEN the ocr_to_unified_converter SHALL parse the HTML and produce a TableData object
  • AND the TableData.cells array SHALL contain TableCell objects for each cell
  • AND each TableCell SHALL have correct row, col, and content values
  • AND the output format SHALL match Direct Track's TableData structure

Scenario: OCR Track handles tables with merged cells

  • GIVEN an HTML table with rowspan or colspan attributes
  • WHEN the table is converted to TableData
  • THEN each TableCell SHALL have correct row_span and col_span values
  • AND the cell content SHALL be correctly extracted

Scenario: OCR Track handles header rows

  • GIVEN an HTML table with <th> elements or a header row
  • WHEN the table is converted to TableData
  • THEN the TableData.headers field SHALL contain the header cell contents
  • AND header cells SHALL also be included in the cells array

Scenario: OCR Track gracefully handles malformed HTML tables

  • GIVEN an HTML table with malformed markup (missing closing tags, invalid nesting)
  • WHEN parsing is attempted
  • THEN the system SHALL attempt best-effort parsing using a tolerant HTML parser
  • AND if parsing fails completely, SHALL fall back to returning basic TableData with row/col counts
  • AND SHALL log a warning for debugging purposes

Scenario: PDF Generator renders OCR Track tables correctly

  • GIVEN a UnifiedDocument from OCR Track containing table elements
  • WHEN the PDF Generator processes the document
  • THEN tables SHALL be rendered as formatted tables (not as raw HTML text)
  • AND the rendering SHALL be identical to Direct Track table rendering

Scenario: Direct Track table processing remains unchanged

  • GIVEN a native PDF with embedded tables
  • WHEN the document is processed via Direct Track
  • THEN the DirectExtractionEngine SHALL continue to produce TableData objects as before
  • AND the ocr_to_unified_converter.py changes SHALL NOT affect Direct Track processing
  • AND table rendering in PDF output SHALL be identical to pre-fix behavior

Scenario: Hybrid Mode table source isolation

  • GIVEN a document processed via Hybrid Mode (Direct Track primary + OCR Track for images)
  • WHEN the system merges OCR Track results into Direct Track results
  • THEN only image elements (FIGURE, IMAGE, LOGO) SHALL be merged from OCR Track
  • AND table elements SHALL exclusively come from Direct Track
  • AND no OCR Track table data SHALL contaminate the final output