Files
OCR/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/proposal.md
egg cfe65158a3 feat: enable document orientation detection for scanned PDFs
- Enable PP-StructureV3's use_doc_orientation_classify feature
- Detect rotation angle from doc_preprocessor_res.angle
- Swap page dimensions (width <-> height) for 90°/270° rotations
- Output PDF now correctly displays landscape-scanned content

Also includes:
- Archive completed openspec proposals
- Add simplify-frontend-ocr-config proposal (pending)
- Code cleanup and frontend simplification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 17:13:46 +08:00

3.1 KiB
Raw Blame History

Proposal: Use cell_boxes as Primary Table Rendering Source

Summary

Modify table PDF rendering to use cell_boxes coordinates as the primary source for table structure instead of relying on HTML table parsing. This resolves grid mismatch issues where PP-StructureV3's HTML structure (with colspan/rowspan) doesn't match the cell_boxes coordinate grid.

Problem Statement

Current Issue

When processing scan.pdf, PP-StructureV3 detected tables with the following characteristics:

Table 7 (Element 7):

  • cell_boxes: 27 cells forming an 11x10 grid (by coordinate clustering)
  • HTML structure: 9 rows with irregular columns [7, 7, 1, 3, 3, 3, 3, 3, 1] due to colspan

This grid mismatch causes:

  1. _compute_table_grid_from_cell_boxes() returns None, None
  2. PDF generator falls back to ReportLab Table with equal column distribution
  3. Table renders with incorrect column widths, causing visual misalignment

Root Cause

PP-StructureV3 sometimes merges multiple visual tables into one large table region:

  • The cell_boxes accurately detect individual cell boundaries
  • The HTML uses colspan to represent merged cells, but the grid doesn't match cell_boxes
  • Current logic requires exact grid match, which fails for complex merged tables

Proposed Solution

Strategy: cell_boxes-First Rendering

Instead of requiring HTML grid to match cell_boxes, use cell_boxes directly as the authoritative source for cell boundaries:

  1. Grid Inference from cell_boxes

    • Cluster cell_boxes by Y-coordinate to determine rows
    • Cluster cell_boxes by X-coordinate to determine columns
    • Build a row×col grid map from cell_boxes positions
  2. Content Assignment from HTML

    • Extract text content from HTML in reading order
    • Map text content to cell_boxes positions using coordinate matching
    • Handle cases where HTML has fewer/more cells than cell_boxes
  3. Direct PDF Rendering

    • Render table borders using cell_boxes coordinates (already implemented)
    • Place text content at calculated cell positions
    • Skip ReportLab Table parsing when cell_boxes grid is valid

Key Changes

Component Change
pdf_generator_service.py Add cell_boxes-first rendering path
table_content_rebuilder.py Enhance to support grid-based content mapping
config.py Add table_rendering_prefer_cellboxes: bool setting

Benefits

  1. Accurate Table Borders: cell_boxes from ML detection are more precise than HTML parsing
  2. Handles Grid Mismatch: Works even when HTML colspan/rowspan don't match cell count
  3. Consistent Output: Same rendering logic regardless of HTML complexity
  4. Backward Compatible: Existing HTML-based rendering remains as fallback

Non-Goals

  • Not modifying PP-StructureV3 detection logic
  • Not implementing table splitting (separate proposal if needed)
  • Not changing Direct track (PyMuPDF) table extraction

Success Criteria

  1. scan.pdf Table 7 renders with correct column widths based on cell_boxes
  2. All existing table tests continue to pass
  3. No regression for tables where HTML grid matches cell_boxes