Files
OCR/openspec/changes/archive/2025-12-03-fix-pdf-table-rendering/design.md
egg 1b5c7f39a8 fix: improve PDF layout generation for Direct track
Key fixes:
- Skip large vector_graphics charts (>50% page coverage) that cover text
- Fix font fallback to use NotoSansSC for CJK support instead of Helvetica
- Improve translated table rendering with dynamic font sizing
- Add merged cell (row_span/col_span) support for reflow tables
- Skip text elements inside table bboxes to avoid duplication

Archive openspec proposal: fix-pdf-table-rendering

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 14:55:00 +08:00

2.4 KiB

Design: Fix PDF Table Rendering

Context

OCR track produces tables with:

  • cell_boxes: Accurate pixel coordinates for each cell border
  • cells: Content with row/col indices and row_span/col_span
  • embedded_images: Images within table cells

Current implementations fail to use these correctly:

  • Reflow PDF: Ignores merged cells, misaligns content
  • Translated Layout PDF: Creates new Table object instead of using cell_boxes

Goals / Non-Goals

Goals:

  • Translated Layout PDF tables match untranslated Layout PDF quality
  • Reflow PDF tables are readable and correctly structured
  • Embedded images appear in both formats

Non-Goals:

  • Perfect pixel-level replication of original table styling
  • Support for complex nested tables

Decisions

Decision 1: Translated Layout PDF uses Layered Rendering

What: Draw cell borders using cell_boxes, then render translated text in each cell separately Why: This matches the working approach in _draw_table_with_cell_boxes() for untranslated PDFs

# Step 1: Draw borders using cell_boxes
for cell_box in cell_boxes:
    pdf_canvas.rect(x, y, width, height)

# Step 2: Render text for each cell
for cell in cells:
    cell_bbox = find_matching_cell_box(cell, cell_boxes)
    draw_text_in_bbox(translated_content, cell_bbox)

Decision 2: Reflow PDF uses ReportLab SPAN for merged cells

What: Apply ('SPAN', (col1, row1), (col2, row2)) style for merged cells Why: ReportLab's Table natively supports merged cells via TableStyle

# Build span commands from cell data
for cell in cells:
    if cell.row_span > 1 or cell.col_span > 1:
        spans.append(('SPAN',
            (cell.col, cell.row),
            (cell.col + cell.col_span - 1, cell.row + cell.row_span - 1)))

Decision 3: Column widths from cell_boxes ratio

What: Calculate column widths proportionally from cell_boxes Why: Preserves original table structure in reflow mode

Risks / Trade-offs

Risk Mitigation
Text overflow in translated cells Shrink font (min 8pt) or truncate with ellipsis
cell_boxes not matching cells count Fall back to equal-width columns
Complex merged cell patterns Handle simple spans, skip complex patterns

Open Questions

  • Should reflow PDF preserve exact column width ratios or allow ReportLab auto-sizing?
  • How to handle cells with both text and images?