Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.1 KiB
3.1 KiB
Proposal: Use cell_boxes as Primary Table Rendering Source
Summary
Modify table PDF rendering to use cell_boxes coordinates as the primary source for table structure instead of relying on HTML table parsing. This resolves grid mismatch issues where PP-StructureV3's HTML structure (with colspan/rowspan) doesn't match the cell_boxes coordinate grid.
Problem Statement
Current Issue
When processing scan.pdf, PP-StructureV3 detected tables with the following characteristics:
Table 7 (Element 7):
cell_boxes: 27 cells forming an 11x10 grid (by coordinate clustering)- HTML structure: 9 rows with irregular columns
[7, 7, 1, 3, 3, 3, 3, 3, 1]due to colspan
This grid mismatch causes:
_compute_table_grid_from_cell_boxes()returnsNone, None- PDF generator falls back to ReportLab Table with equal column distribution
- Table renders with incorrect column widths, causing visual misalignment
Root Cause
PP-StructureV3 sometimes merges multiple visual tables into one large table region:
- The cell_boxes accurately detect individual cell boundaries
- The HTML uses colspan to represent merged cells, but the grid doesn't match cell_boxes
- Current logic requires exact grid match, which fails for complex merged tables
Proposed Solution
Strategy: cell_boxes-First Rendering
Instead of requiring HTML grid to match cell_boxes, use cell_boxes directly as the authoritative source for cell boundaries:
-
Grid Inference from cell_boxes
- Cluster cell_boxes by Y-coordinate to determine rows
- Cluster cell_boxes by X-coordinate to determine columns
- Build a row×col grid map from cell_boxes positions
-
Content Assignment from HTML
- Extract text content from HTML in reading order
- Map text content to cell_boxes positions using coordinate matching
- Handle cases where HTML has fewer/more cells than cell_boxes
-
Direct PDF Rendering
- Render table borders using cell_boxes coordinates (already implemented)
- Place text content at calculated cell positions
- Skip ReportLab Table parsing when cell_boxes grid is valid
Key Changes
| Component | Change |
|---|---|
pdf_generator_service.py |
Add cell_boxes-first rendering path |
table_content_rebuilder.py |
Enhance to support grid-based content mapping |
config.py |
Add table_rendering_prefer_cellboxes: bool setting |
Benefits
- Accurate Table Borders: cell_boxes from ML detection are more precise than HTML parsing
- Handles Grid Mismatch: Works even when HTML colspan/rowspan don't match cell count
- Consistent Output: Same rendering logic regardless of HTML complexity
- Backward Compatible: Existing HTML-based rendering remains as fallback
Non-Goals
- Not modifying PP-StructureV3 detection logic
- Not implementing table splitting (separate proposal if needed)
- Not changing Direct track (PyMuPDF) table extraction
Success Criteria
scan.pdfTable 7 renders with correct column widths based on cell_boxes- All existing table tests continue to pass
- No regression for tables where HTML grid matches cell_boxes