fix: OCR track table data format and image cropping

Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays

Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern

Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)

Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-26 18:48:15 +08:00
parent a227311b2d
commit 6e050eb540
8 changed files with 585 additions and 30 deletions

View File

@@ -0,0 +1,51 @@
## ADDED Requirements
### Requirement: OCR Track Table Data Structure Consistency
The OCR Track SHALL produce `TableData` objects with fully populated `cells` arrays that match the format produced by Direct Track, ensuring consistent table rendering across both processing tracks.
#### Scenario: OCR Track produces structured TableData for HTML tables
- **GIVEN** a document with tables is processed via OCR Track
- **WHEN** PP-StructureV3 returns HTML table content in the `html` or `content` field
- **THEN** the `ocr_to_unified_converter` SHALL parse the HTML and produce a `TableData` object
- **AND** the `TableData.cells` array SHALL contain `TableCell` objects for each cell
- **AND** each `TableCell` SHALL have correct `row`, `col`, and `content` values
- **AND** the output format SHALL match Direct Track's `TableData` structure
#### Scenario: OCR Track handles tables with merged cells
- **GIVEN** an HTML table with `rowspan` or `colspan` attributes
- **WHEN** the table is converted to `TableData`
- **THEN** each `TableCell` SHALL have correct `row_span` and `col_span` values
- **AND** the cell content SHALL be correctly extracted
#### Scenario: OCR Track handles header rows
- **GIVEN** an HTML table with `<th>` elements or a header row
- **WHEN** the table is converted to `TableData`
- **THEN** the `TableData.headers` field SHALL contain the header cell contents
- **AND** header cells SHALL also be included in the `cells` array
#### Scenario: OCR Track gracefully handles malformed HTML tables
- **GIVEN** an HTML table with malformed markup (missing closing tags, invalid nesting)
- **WHEN** parsing is attempted
- **THEN** the system SHALL attempt best-effort parsing using a tolerant HTML parser
- **AND** if parsing fails completely, SHALL fall back to returning basic `TableData` with row/col counts
- **AND** SHALL log a warning for debugging purposes
#### Scenario: PDF Generator renders OCR Track tables correctly
- **GIVEN** a `UnifiedDocument` from OCR Track containing table elements
- **WHEN** the PDF Generator processes the document
- **THEN** tables SHALL be rendered as formatted tables (not as raw HTML text)
- **AND** the rendering SHALL be identical to Direct Track table rendering
#### Scenario: Direct Track table processing remains unchanged
- **GIVEN** a native PDF with embedded tables
- **WHEN** the document is processed via Direct Track
- **THEN** the `DirectExtractionEngine` SHALL continue to produce `TableData` objects as before
- **AND** the `ocr_to_unified_converter.py` changes SHALL NOT affect Direct Track processing
- **AND** table rendering in PDF output SHALL be identical to pre-fix behavior
#### Scenario: Hybrid Mode table source isolation
- **GIVEN** a document processed via Hybrid Mode (Direct Track primary + OCR Track for images)
- **WHEN** the system merges OCR Track results into Direct Track results
- **THEN** only image elements (FIGURE, IMAGE, LOGO) SHALL be merged from OCR Track
- **AND** table elements SHALL exclusively come from Direct Track
- **AND** no OCR Track table data SHALL contaminate the final output