fix: improve PDF layout generation for Direct track

Key fixes: - Skip large vector_graphics charts (>50% page coverage) that cover text - Fix font fallback to use NotoSansSC for CJK support instead of Helvetica - Improve translated table rendering with dynamic font sizing - Add merged cell (row_span/col_span) support for reflow tables - Skip text elements inside table bboxes to avoid duplication Archive openspec proposal: fix-pdf-table-rendering 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 14:55:00 +08:00
parent 08adf3d01d
commit 1b5c7f39a8
5 changed files with 405 additions and 111 deletions
--- a/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/design.md
+++ b/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/design.md
@@ -0,0 +1,68 @@
+# Design: Fix PDF Table Rendering
+
+## Context
+OCR track produces tables with:
+- `cell_boxes`: Accurate pixel coordinates for each cell border
+- `cells`: Content with row/col indices and row_span/col_span
+- `embedded_images`: Images within table cells
+
+Current implementations fail to use these correctly:
+- **Reflow PDF**: Ignores merged cells, misaligns content
+- **Translated Layout PDF**: Creates new Table object instead of using cell_boxes
+
+## Goals / Non-Goals
+
+**Goals:**
+- Translated Layout PDF tables match untranslated Layout PDF quality
+- Reflow PDF tables are readable and correctly structured
+- Embedded images appear in both formats
+
+**Non-Goals:**
+- Perfect pixel-level replication of original table styling
+- Support for complex nested tables
+
+## Decisions
+
+### Decision 1: Translated Layout PDF uses Layered Rendering
+**What**: Draw cell borders using `cell_boxes`, then render translated text in each cell separately
+**Why**: This matches the working approach in `_draw_table_with_cell_boxes()` for untranslated PDFs
+
+```python
+# Step 1: Draw borders using cell_boxes
+for cell_box in cell_boxes:
+    pdf_canvas.rect(x, y, width, height)
+
+# Step 2: Render text for each cell
+for cell in cells:
+    cell_bbox = find_matching_cell_box(cell, cell_boxes)
+    draw_text_in_bbox(translated_content, cell_bbox)
+```
+
+### Decision 2: Reflow PDF uses ReportLab SPAN for merged cells
+**What**: Apply `('SPAN', (col1, row1), (col2, row2))` style for merged cells
+**Why**: ReportLab's Table natively supports merged cells via TableStyle
+
+```python
+# Build span commands from cell data
+for cell in cells:
+    if cell.row_span > 1 or cell.col_span > 1:
+        spans.append(('SPAN',
+            (cell.col, cell.row),
+            (cell.col + cell.col_span - 1, cell.row + cell.row_span - 1)))
+```
+
+### Decision 3: Column widths from cell_boxes ratio
+**What**: Calculate column widths proportionally from cell_boxes
+**Why**: Preserves original table structure in reflow mode
+
+## Risks / Trade-offs
+
+| Risk | Mitigation |
+|------|------------|
+| Text overflow in translated cells | Shrink font (min 8pt) or truncate with ellipsis |
+| cell_boxes not matching cells count | Fall back to equal-width columns |
+| Complex merged cell patterns | Handle simple spans, skip complex patterns |
+
+## Open Questions
+- Should reflow PDF preserve exact column width ratios or allow ReportLab auto-sizing?
+- How to handle cells with both text and images?