chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions
--- a/openspec/changes/use-cellboxes-for-table-rendering/proposal.md
+++ b/openspec/changes/use-cellboxes-for-table-rendering/proposal.md
@@ -0,0 +1,75 @@
+# Proposal: Use cell_boxes as Primary Table Rendering Source
+
+## Summary
+
+Modify table PDF rendering to use `cell_boxes` coordinates as the primary source for table structure instead of relying on HTML table parsing. This resolves grid mismatch issues where PP-StructureV3's HTML structure (with colspan/rowspan) doesn't match the cell_boxes coordinate grid.
+
+## Problem Statement
+
+### Current Issue
+
+When processing `scan.pdf`, PP-StructureV3 detected tables with the following characteristics:
+
+**Table 7 (Element 7)**:
+- `cell_boxes`: 27 cells forming an 11x10 grid (by coordinate clustering)
+- HTML structure: 9 rows with irregular columns `[7, 7, 1, 3, 3, 3, 3, 3, 1]` due to colspan
+
+This **grid mismatch** causes:
+1. `_compute_table_grid_from_cell_boxes()` returns `None, None`
+2. PDF generator falls back to ReportLab Table with equal column distribution
+3. Table renders with incorrect column widths, causing visual misalignment
+
+### Root Cause
+
+PP-StructureV3 sometimes merges multiple visual tables into one large table region:
+- The cell_boxes accurately detect individual cell boundaries
+- The HTML uses colspan to represent merged cells, but the grid doesn't match cell_boxes
+- Current logic requires exact grid match, which fails for complex merged tables
+
+## Proposed Solution
+
+### Strategy: cell_boxes-First Rendering
+
+Instead of requiring HTML grid to match cell_boxes, **use cell_boxes directly** as the authoritative source for cell boundaries:
+
+1. **Grid Inference from cell_boxes**
+   - Cluster cell_boxes by Y-coordinate to determine rows
+   - Cluster cell_boxes by X-coordinate to determine columns
+   - Build a row×col grid map from cell_boxes positions
+
+2. **Content Assignment from HTML**
+   - Extract text content from HTML in reading order
+   - Map text content to cell_boxes positions using coordinate matching
+   - Handle cases where HTML has fewer/more cells than cell_boxes
+
+3. **Direct PDF Rendering**
+   - Render table borders using cell_boxes coordinates (already implemented)
+   - Place text content at calculated cell positions
+   - Skip ReportLab Table parsing when cell_boxes grid is valid
+
+### Key Changes
+
+| Component | Change |
+|-----------|--------|
+| `pdf_generator_service.py` | Add cell_boxes-first rendering path |
+| `table_content_rebuilder.py` | Enhance to support grid-based content mapping |
+| `config.py` | Add `table_rendering_prefer_cellboxes: bool` setting |
+
+## Benefits
+
+1. **Accurate Table Borders**: cell_boxes from ML detection are more precise than HTML parsing
+2. **Handles Grid Mismatch**: Works even when HTML colspan/rowspan don't match cell count
+3. **Consistent Output**: Same rendering logic regardless of HTML complexity
+4. **Backward Compatible**: Existing HTML-based rendering remains as fallback
+
+## Non-Goals
+
+- Not modifying PP-StructureV3 detection logic
+- Not implementing table splitting (separate proposal if needed)
+- Not changing Direct track (PyMuPDF) table extraction
+
+## Success Criteria
+
+1. `scan.pdf` Table 7 renders with correct column widths based on cell_boxes
+2. All existing table tests continue to pass
+3. No regression for tables where HTML grid matches cell_boxes