- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
76 lines
3.1 KiB
Markdown
76 lines
3.1 KiB
Markdown
# Proposal: Use cell_boxes as Primary Table Rendering Source
|
||
|
||
## Summary
|
||
|
||
Modify table PDF rendering to use `cell_boxes` coordinates as the primary source for table structure instead of relying on HTML table parsing. This resolves grid mismatch issues where PP-StructureV3's HTML structure (with colspan/rowspan) doesn't match the cell_boxes coordinate grid.
|
||
|
||
## Problem Statement
|
||
|
||
### Current Issue
|
||
|
||
When processing `scan.pdf`, PP-StructureV3 detected tables with the following characteristics:
|
||
|
||
**Table 7 (Element 7)**:
|
||
- `cell_boxes`: 27 cells forming an 11x10 grid (by coordinate clustering)
|
||
- HTML structure: 9 rows with irregular columns `[7, 7, 1, 3, 3, 3, 3, 3, 1]` due to colspan
|
||
|
||
This **grid mismatch** causes:
|
||
1. `_compute_table_grid_from_cell_boxes()` returns `None, None`
|
||
2. PDF generator falls back to ReportLab Table with equal column distribution
|
||
3. Table renders with incorrect column widths, causing visual misalignment
|
||
|
||
### Root Cause
|
||
|
||
PP-StructureV3 sometimes merges multiple visual tables into one large table region:
|
||
- The cell_boxes accurately detect individual cell boundaries
|
||
- The HTML uses colspan to represent merged cells, but the grid doesn't match cell_boxes
|
||
- Current logic requires exact grid match, which fails for complex merged tables
|
||
|
||
## Proposed Solution
|
||
|
||
### Strategy: cell_boxes-First Rendering
|
||
|
||
Instead of requiring HTML grid to match cell_boxes, **use cell_boxes directly** as the authoritative source for cell boundaries:
|
||
|
||
1. **Grid Inference from cell_boxes**
|
||
- Cluster cell_boxes by Y-coordinate to determine rows
|
||
- Cluster cell_boxes by X-coordinate to determine columns
|
||
- Build a row×col grid map from cell_boxes positions
|
||
|
||
2. **Content Assignment from HTML**
|
||
- Extract text content from HTML in reading order
|
||
- Map text content to cell_boxes positions using coordinate matching
|
||
- Handle cases where HTML has fewer/more cells than cell_boxes
|
||
|
||
3. **Direct PDF Rendering**
|
||
- Render table borders using cell_boxes coordinates (already implemented)
|
||
- Place text content at calculated cell positions
|
||
- Skip ReportLab Table parsing when cell_boxes grid is valid
|
||
|
||
### Key Changes
|
||
|
||
| Component | Change |
|
||
|-----------|--------|
|
||
| `pdf_generator_service.py` | Add cell_boxes-first rendering path |
|
||
| `table_content_rebuilder.py` | Enhance to support grid-based content mapping |
|
||
| `config.py` | Add `table_rendering_prefer_cellboxes: bool` setting |
|
||
|
||
## Benefits
|
||
|
||
1. **Accurate Table Borders**: cell_boxes from ML detection are more precise than HTML parsing
|
||
2. **Handles Grid Mismatch**: Works even when HTML colspan/rowspan don't match cell count
|
||
3. **Consistent Output**: Same rendering logic regardless of HTML complexity
|
||
4. **Backward Compatible**: Existing HTML-based rendering remains as fallback
|
||
|
||
## Non-Goals
|
||
|
||
- Not modifying PP-StructureV3 detection logic
|
||
- Not implementing table splitting (separate proposal if needed)
|
||
- Not changing Direct track (PyMuPDF) table extraction
|
||
|
||
## Success Criteria
|
||
|
||
1. `scan.pdf` Table 7 renders with correct column widths based on cell_boxes
|
||
2. All existing table tests continue to pass
|
||
3. No regression for tables where HTML grid matches cell_boxes
|