wip: add TableData.from_dict() for OCR track table parsing (incomplete)

Add TableData.from_dict() and TableCell.from_dict() methods to convert JSON table dicts to proper TableData objects during UnifiedDocument parsing. Modified _json_to_document_element() to detect TABLE elements with dict content containing 'cells' key and convert to TableData. Note: This fix ensures table elements have proper to_html() method available but the rendered output still needs investigation - tables may still render incorrectly in OCR track PDFs. Files changed: - unified_document.py: Add from_dict() class methods - pdf_generator_service.py: Convert table dicts during JSON parsing - Add fix-ocr-track-table-rendering proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 19:16:51 +08:00
parent 6e050eb540
commit c65df754cf
5 changed files with 281 additions and 1 deletions
--- a/openspec/changes/fix-ocr-track-table-rendering/tasks.md
+++ b/openspec/changes/fix-ocr-track-table-rendering/tasks.md
@@ -0,0 +1,55 @@
+# Implementation Tasks
+
+## Phase 1: Core Fix - Table Content Conversion
+
+### 1.1 Add TableData.from_dict() class method
+- [ ] In `unified_document.py`, add `from_dict()` method to `TableData` class
+- [ ] Handle conversion of cells list (list of dicts) to `TableCell` objects
+- [ ] Preserve rows, cols, headers, caption fields
+
+### 1.2 Fix _json_to_document_element for TABLE elements
+- [ ] In `pdf_generator_service.py`, modify `_json_to_document_element`
+- [ ] When `elem_type == ElementType.TABLE` and content is dict with 'cells', convert to `TableData`
+- [ ] Use `TableData.from_dict()` for clean conversion
+
+### 1.3 Verify TableData.to_html() generates correct HTML
+- [ ] Test that `to_html()` produces parseable HTML with proper row/cell structure
+- [ ] Verify colspan/rowspan attributes are correctly generated
+- [ ] Ensure empty cells are properly handled
+
+## Phase 2: OCR Track Rendering Consistency
+
+### 2.1 Review convert_unified_document_to_ocr_data
+- [ ] Verify TableData objects are properly converted to HTML
+- [ ] Add fallback handling for dict content with 'cells' key
+- [ ] Log warning if content cannot be converted to HTML
+
+### 2.2 Review draw_table_region
+- [ ] Verify HTMLTableParser correctly parses generated HTML
+- [ ] Check that ReportLab Table is positioned at correct bbox
+- [ ] Verify font and style application
+
+## Phase 3: Testing and Verification
+
+### 3.1 Test OCR Track
+- [ ] Test scan.pdf - verify tables have correct structure
+- [ ] Test img1.png, img2.png, img3.png
+- [ ] Compare generated PDF with original documents
+
+### 3.2 Test Direct Track (Regression)
+- [ ] Test PDF files with Direct track
+- [ ] Verify table rendering unchanged
+
+### 3.3 Test Hybrid Mode
+- [ ] Test files that trigger hybrid processing
+- [ ] Verify mixed Direct + OCR elements render correctly
+
+## Phase 4: Code Quality
+
+### 4.1 Add logging
+- [ ] Add debug logging for table content type detection
+- [ ] Log conversion steps for troubleshooting
+
+### 4.2 Error handling
+- [ ] Handle malformed cell data gracefully
+- [ ] Log warnings for unexpected content formats