Key fixes: - Skip large vector_graphics charts (>50% page coverage) that cover text - Fix font fallback to use NotoSansSC for CJK support instead of Helvetica - Improve translated table rendering with dynamic font sizing - Add merged cell (row_span/col_span) support for reflow tables - Skip text elements inside table bboxes to avoid duplication Archive openspec proposal: fix-pdf-table-rendering 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2.4 KiB
2.4 KiB
Design: Fix PDF Table Rendering
Context
OCR track produces tables with:
cell_boxes: Accurate pixel coordinates for each cell bordercells: Content with row/col indices and row_span/col_spanembedded_images: Images within table cells
Current implementations fail to use these correctly:
- Reflow PDF: Ignores merged cells, misaligns content
- Translated Layout PDF: Creates new Table object instead of using cell_boxes
Goals / Non-Goals
Goals:
- Translated Layout PDF tables match untranslated Layout PDF quality
- Reflow PDF tables are readable and correctly structured
- Embedded images appear in both formats
Non-Goals:
- Perfect pixel-level replication of original table styling
- Support for complex nested tables
Decisions
Decision 1: Translated Layout PDF uses Layered Rendering
What: Draw cell borders using cell_boxes, then render translated text in each cell separately
Why: This matches the working approach in _draw_table_with_cell_boxes() for untranslated PDFs
# Step 1: Draw borders using cell_boxes
for cell_box in cell_boxes:
pdf_canvas.rect(x, y, width, height)
# Step 2: Render text for each cell
for cell in cells:
cell_bbox = find_matching_cell_box(cell, cell_boxes)
draw_text_in_bbox(translated_content, cell_bbox)
Decision 2: Reflow PDF uses ReportLab SPAN for merged cells
What: Apply ('SPAN', (col1, row1), (col2, row2)) style for merged cells
Why: ReportLab's Table natively supports merged cells via TableStyle
# Build span commands from cell data
for cell in cells:
if cell.row_span > 1 or cell.col_span > 1:
spans.append(('SPAN',
(cell.col, cell.row),
(cell.col + cell.col_span - 1, cell.row + cell.row_span - 1)))
Decision 3: Column widths from cell_boxes ratio
What: Calculate column widths proportionally from cell_boxes Why: Preserves original table structure in reflow mode
Risks / Trade-offs
| Risk | Mitigation |
|---|---|
| Text overflow in translated cells | Shrink font (min 8pt) or truncate with ellipsis |
| cell_boxes not matching cells count | Fall back to equal-width columns |
| Complex merged cell patterns | Handle simple spans, skip complex patterns |
Open Questions
- Should reflow PDF preserve exact column width ratios or allow ReportLab auto-sizing?
- How to handle cells with both text and images?