# Proposal: Use cell_boxes as Primary Table Rendering Source ## Summary Modify table PDF rendering to use `cell_boxes` coordinates as the primary source for table structure instead of relying on HTML table parsing. This resolves grid mismatch issues where PP-StructureV3's HTML structure (with colspan/rowspan) doesn't match the cell_boxes coordinate grid. ## Problem Statement ### Current Issue When processing `scan.pdf`, PP-StructureV3 detected tables with the following characteristics: **Table 7 (Element 7)**: - `cell_boxes`: 27 cells forming an 11x10 grid (by coordinate clustering) - HTML structure: 9 rows with irregular columns `[7, 7, 1, 3, 3, 3, 3, 3, 1]` due to colspan This **grid mismatch** causes: 1. `_compute_table_grid_from_cell_boxes()` returns `None, None` 2. PDF generator falls back to ReportLab Table with equal column distribution 3. Table renders with incorrect column widths, causing visual misalignment ### Root Cause PP-StructureV3 sometimes merges multiple visual tables into one large table region: - The cell_boxes accurately detect individual cell boundaries - The HTML uses colspan to represent merged cells, but the grid doesn't match cell_boxes - Current logic requires exact grid match, which fails for complex merged tables ## Proposed Solution ### Strategy: cell_boxes-First Rendering Instead of requiring HTML grid to match cell_boxes, **use cell_boxes directly** as the authoritative source for cell boundaries: 1. **Grid Inference from cell_boxes** - Cluster cell_boxes by Y-coordinate to determine rows - Cluster cell_boxes by X-coordinate to determine columns - Build a row×col grid map from cell_boxes positions 2. **Content Assignment from HTML** - Extract text content from HTML in reading order - Map text content to cell_boxes positions using coordinate matching - Handle cases where HTML has fewer/more cells than cell_boxes 3. **Direct PDF Rendering** - Render table borders using cell_boxes coordinates (already implemented) - Place text content at calculated cell positions - Skip ReportLab Table parsing when cell_boxes grid is valid ### Key Changes | Component | Change | |-----------|--------| | `pdf_generator_service.py` | Add cell_boxes-first rendering path | | `table_content_rebuilder.py` | Enhance to support grid-based content mapping | | `config.py` | Add `table_rendering_prefer_cellboxes: bool` setting | ## Benefits 1. **Accurate Table Borders**: cell_boxes from ML detection are more precise than HTML parsing 2. **Handles Grid Mismatch**: Works even when HTML colspan/rowspan don't match cell count 3. **Consistent Output**: Same rendering logic regardless of HTML complexity 4. **Backward Compatible**: Existing HTML-based rendering remains as fallback ## Non-Goals - Not modifying PP-StructureV3 detection logic - Not implementing table splitting (separate proposal if needed) - Not changing Direct track (PyMuPDF) table extraction ## Success Criteria 1. `scan.pdf` Table 7 renders with correct column widths based on cell_boxes 2. All existing table tests continue to pass 3. No regression for tables where HTML grid matches cell_boxes