egg/OCR

Files

egg 940a406dce chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 11:55:39 +08:00

3.1 KiB

Raw Blame History

Proposal: Use cell_boxes as Primary Table Rendering Source

Summary

Modify table PDF rendering to use cell_boxes coordinates as the primary source for table structure instead of relying on HTML table parsing. This resolves grid mismatch issues where PP-StructureV3's HTML structure (with colspan/rowspan) doesn't match the cell_boxes coordinate grid.

Problem Statement

Current Issue

When processing scan.pdf, PP-StructureV3 detected tables with the following characteristics:

Table 7 (Element 7):

cell_boxes: 27 cells forming an 11x10 grid (by coordinate clustering)
HTML structure: 9 rows with irregular columns [7, 7, 1, 3, 3, 3, 3, 3, 1] due to colspan

This grid mismatch causes:

_compute_table_grid_from_cell_boxes() returns None, None
PDF generator falls back to ReportLab Table with equal column distribution
Table renders with incorrect column widths, causing visual misalignment

Root Cause

PP-StructureV3 sometimes merges multiple visual tables into one large table region:

The cell_boxes accurately detect individual cell boundaries
The HTML uses colspan to represent merged cells, but the grid doesn't match cell_boxes
Current logic requires exact grid match, which fails for complex merged tables

Proposed Solution

Strategy: cell_boxes-First Rendering

Instead of requiring HTML grid to match cell_boxes, use cell_boxes directly as the authoritative source for cell boundaries:

Grid Inference from cell_boxes
- Cluster cell_boxes by Y-coordinate to determine rows
- Cluster cell_boxes by X-coordinate to determine columns
- Build a row×col grid map from cell_boxes positions
Content Assignment from HTML
- Extract text content from HTML in reading order
- Map text content to cell_boxes positions using coordinate matching
- Handle cases where HTML has fewer/more cells than cell_boxes
Direct PDF Rendering
- Render table borders using cell_boxes coordinates (already implemented)
- Place text content at calculated cell positions
- Skip ReportLab Table parsing when cell_boxes grid is valid

Key Changes

Component	Change
`pdf_generator_service.py`	Add cell_boxes-first rendering path
`table_content_rebuilder.py`	Enhance to support grid-based content mapping
`config.py`	Add `table_rendering_prefer_cellboxes: bool` setting

Benefits

Accurate Table Borders: cell_boxes from ML detection are more precise than HTML parsing
Handles Grid Mismatch: Works even when HTML colspan/rowspan don't match cell count
Consistent Output: Same rendering logic regardless of HTML complexity
Backward Compatible: Existing HTML-based rendering remains as fallback

Non-Goals

Not modifying PP-StructureV3 detection logic
Not implementing table splitting (separate proposal if needed)
Not changing Direct track (PyMuPDF) table extraction

Success Criteria

scan.pdf Table 7 renders with correct column widths based on cell_boxes
All existing table tests continue to pass
No regression for tables where HTML grid matches cell_boxes

3.1 KiB Raw Blame History Unescape Escape