Files
OCR/openspec/changes/archive/2025-11-26-fix-ocr-track-table-data-format/tasks.md
egg 6e050eb540 fix: OCR track table data format and image cropping
Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays

Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern

Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)

Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 18:48:15 +08:00

2.1 KiB

Tasks: Fix OCR Track Table Data Format

1. Implementation

  • 1.1 Add BeautifulSoup import and dependency check in ocr_to_unified_converter.py
  • 1.2 Rewrite _extract_table_data method to parse HTML using BeautifulSoup
  • 1.3 Extract cell content, row index, column index for each <td> and <th> element
  • 1.4 Handle rowspan and colspan attributes for merged cells
  • 1.5 Create TableCell objects with proper content and positioning
  • 1.6 Populate TableData.cells array with extracted TableCell objects
  • 1.7 Preserve header detection (<th> elements) and store in TableData.headers

2. Edge Case Handling

  • 2.1 Handle malformed HTML tables gracefully (missing closing tags, nested tables)
  • 2.2 Handle empty cells (create TableCell with empty string content)
  • 2.3 Handle tables without <tr> structure (fallback to current behavior)
  • 2.4 Log warnings for unparseable tables instead of failing silently

3. Testing

  • 3.1 Create unit tests for _extract_table_data with various HTML table formats
  • 3.2 Test simple tables (basic rows/columns)
  • 3.3 Test tables with merged cells (rowspan/colspan)
  • 3.4 Test tables with header rows (<th> elements)
  • 3.5 Test malformed HTML tables (handled via BeautifulSoup's tolerance)
  • 3.6 Integration test: OCR Track PDF generation with tables

4. Verification (Track Isolation)

  • 4.1 Compare OCR Track table output format with Direct Track output format
  • 4.2 Verify PDF Generator renders OCR Track tables correctly
  • 4.3 Direct Track regression test: direct_extraction_engine.py NOT modified (confirmed via git status)
  • 4.4 Hybrid Mode regression test: ocr_service.py NOT modified, image merge logic unchanged
  • 4.5 OCR Track fix verification: Unit tests confirm:
    • TableData.cells array is populated (6 cells in 3x2 table)
    • TableCell objects have correct row/col/content values
    • Headers extracted correctly
  • 4.6 Verify DirectExtractionEngine code is NOT modified (isolation check - confirmed)

5. Dependencies

  • 5.1 Add beautifulsoup4>=4.12.0 to requirements.txt