fix: OCR track table data format and image cropping
Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,51 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: OCR Track Table Data Structure Consistency
|
||||
The OCR Track SHALL produce `TableData` objects with fully populated `cells` arrays that match the format produced by Direct Track, ensuring consistent table rendering across both processing tracks.
|
||||
|
||||
#### Scenario: OCR Track produces structured TableData for HTML tables
|
||||
- **GIVEN** a document with tables is processed via OCR Track
|
||||
- **WHEN** PP-StructureV3 returns HTML table content in the `html` or `content` field
|
||||
- **THEN** the `ocr_to_unified_converter` SHALL parse the HTML and produce a `TableData` object
|
||||
- **AND** the `TableData.cells` array SHALL contain `TableCell` objects for each cell
|
||||
- **AND** each `TableCell` SHALL have correct `row`, `col`, and `content` values
|
||||
- **AND** the output format SHALL match Direct Track's `TableData` structure
|
||||
|
||||
#### Scenario: OCR Track handles tables with merged cells
|
||||
- **GIVEN** an HTML table with `rowspan` or `colspan` attributes
|
||||
- **WHEN** the table is converted to `TableData`
|
||||
- **THEN** each `TableCell` SHALL have correct `row_span` and `col_span` values
|
||||
- **AND** the cell content SHALL be correctly extracted
|
||||
|
||||
#### Scenario: OCR Track handles header rows
|
||||
- **GIVEN** an HTML table with `<th>` elements or a header row
|
||||
- **WHEN** the table is converted to `TableData`
|
||||
- **THEN** the `TableData.headers` field SHALL contain the header cell contents
|
||||
- **AND** header cells SHALL also be included in the `cells` array
|
||||
|
||||
#### Scenario: OCR Track gracefully handles malformed HTML tables
|
||||
- **GIVEN** an HTML table with malformed markup (missing closing tags, invalid nesting)
|
||||
- **WHEN** parsing is attempted
|
||||
- **THEN** the system SHALL attempt best-effort parsing using a tolerant HTML parser
|
||||
- **AND** if parsing fails completely, SHALL fall back to returning basic `TableData` with row/col counts
|
||||
- **AND** SHALL log a warning for debugging purposes
|
||||
|
||||
#### Scenario: PDF Generator renders OCR Track tables correctly
|
||||
- **GIVEN** a `UnifiedDocument` from OCR Track containing table elements
|
||||
- **WHEN** the PDF Generator processes the document
|
||||
- **THEN** tables SHALL be rendered as formatted tables (not as raw HTML text)
|
||||
- **AND** the rendering SHALL be identical to Direct Track table rendering
|
||||
|
||||
#### Scenario: Direct Track table processing remains unchanged
|
||||
- **GIVEN** a native PDF with embedded tables
|
||||
- **WHEN** the document is processed via Direct Track
|
||||
- **THEN** the `DirectExtractionEngine` SHALL continue to produce `TableData` objects as before
|
||||
- **AND** the `ocr_to_unified_converter.py` changes SHALL NOT affect Direct Track processing
|
||||
- **AND** table rendering in PDF output SHALL be identical to pre-fix behavior
|
||||
|
||||
#### Scenario: Hybrid Mode table source isolation
|
||||
- **GIVEN** a document processed via Hybrid Mode (Direct Track primary + OCR Track for images)
|
||||
- **WHEN** the system merges OCR Track results into Direct Track results
|
||||
- **THEN** only image elements (FIGURE, IMAGE, LOGO) SHALL be merged from OCR Track
|
||||
- **AND** table elements SHALL exclusively come from Direct Track
|
||||
- **AND** no OCR Track table data SHALL contaminate the final output
|
||||
Reference in New Issue
Block a user