OCR/spec.md at a07aad96b33e6728aba0ba60e7693b13067f4bf6

egg/OCR

Files

egg 6e050eb540 fix: OCR track table data format and image cropping

Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays

Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern

Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)

Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 18:48:15 +08:00

3.0 KiB

Raw Blame History

ADDED Requirements

Requirement: OCR Track Table Data Structure Consistency

The OCR Track SHALL produce TableData objects with fully populated cells arrays that match the format produced by Direct Track, ensuring consistent table rendering across both processing tracks.

Scenario: OCR Track produces structured TableData for HTML tables

GIVEN a document with tables is processed via OCR Track
WHEN PP-StructureV3 returns HTML table content in the html or content field
THEN the ocr_to_unified_converter SHALL parse the HTML and produce a TableData object
AND the TableData.cells array SHALL contain TableCell objects for each cell
AND each TableCell SHALL have correct row, col, and content values
AND the output format SHALL match Direct Track's TableData structure

Scenario: OCR Track handles tables with merged cells

GIVEN an HTML table with rowspan or colspan attributes
WHEN the table is converted to TableData
THEN each TableCell SHALL have correct row_span and col_span values
AND the cell content SHALL be correctly extracted

Scenario: OCR Track handles header rows

GIVEN an HTML table with <th> elements or a header row
WHEN the table is converted to TableData
THEN the TableData.headers field SHALL contain the header cell contents
AND header cells SHALL also be included in the cells array

Scenario: OCR Track gracefully handles malformed HTML tables

GIVEN an HTML table with malformed markup (missing closing tags, invalid nesting)
WHEN parsing is attempted
THEN the system SHALL attempt best-effort parsing using a tolerant HTML parser
AND if parsing fails completely, SHALL fall back to returning basic TableData with row/col counts
AND SHALL log a warning for debugging purposes

Scenario: PDF Generator renders OCR Track tables correctly

GIVEN a UnifiedDocument from OCR Track containing table elements
WHEN the PDF Generator processes the document
THEN tables SHALL be rendered as formatted tables (not as raw HTML text)
AND the rendering SHALL be identical to Direct Track table rendering

Scenario: Direct Track table processing remains unchanged

GIVEN a native PDF with embedded tables
WHEN the document is processed via Direct Track
THEN the DirectExtractionEngine SHALL continue to produce TableData objects as before
AND the ocr_to_unified_converter.py changes SHALL NOT affect Direct Track processing
AND table rendering in PDF output SHALL be identical to pre-fix behavior

Scenario: Hybrid Mode table source isolation

GIVEN a document processed via Hybrid Mode (Direct Track primary + OCR Track for images)
WHEN the system merges OCR Track results into Direct Track results
THEN only image elements (FIGURE, IMAGE, LOGO) SHALL be merged from OCR Track
AND table elements SHALL exclusively come from Direct Track
AND no OCR Track table data SHALL contaminate the final output

3.0 KiB Raw Blame History