- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
10 KiB
10 KiB
Design: cell_boxes-First Table Rendering
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ Table Rendering Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Input: table_element │
│ ├── cell_boxes: [[x0,y0,x1,y1], ...] (from PP-StructureV3)│
│ ├── html: "<table>...</table>" (from PP-StructureV3)│
│ └── bbox: [x0, y0, x1, y1] (table boundary) │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Step 1: Grid Inference from cell_boxes │ │
│ │ │ │
│ │ cell_boxes → cluster by Y → rows │ │
│ │ → cluster by X → cols │ │
│ │ → build grid[row][col] = cell_bbox │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Step 2: Content Extraction from HTML │ │
│ │ │ │
│ │ html → parse → extract text list in reading order │ │
│ │ → flatten colspan/rowspan → [text1, text2, ...] │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Step 3: Content-to-Cell Mapping │ │
│ │ │ │
│ │ Option A: Sequential assignment (text[i] → cell[i]) │ │
│ │ Option B: Coordinate matching (text_bbox ∩ cell_bbox) │ │
│ │ Option C: Row-by-row assignment │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Step 4: PDF Rendering │ │
│ │ │ │
│ │ For each cell in grid: │ │
│ │ 1. Draw cell border at cell_bbox coordinates │ │
│ │ 2. Render text content inside cell │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
│ Output: Table rendered in PDF with accurate cell boundaries │
└─────────────────────────────────────────────────────────────────┘
Detailed Design
1. Grid Inference Algorithm
def infer_grid_from_cellboxes(cell_boxes: List[List[float]], threshold: float = 15.0):
"""
Infer row/column grid structure from cell_boxes coordinates.
Args:
cell_boxes: List of [x0, y0, x1, y1] coordinates
threshold: Clustering threshold for row/column grouping
Returns:
grid: Dict[Tuple[int,int], Dict] mapping (row, col) to cell info
row_heights: List of row heights
col_widths: List of column widths
"""
# 1. Extract all Y-centers and X-centers
y_centers = [(cb[1] + cb[3]) / 2 for cb in cell_boxes]
x_centers = [(cb[0] + cb[2]) / 2 for cb in cell_boxes]
# 2. Cluster Y-centers into rows
rows = cluster_values(y_centers, threshold) # Returns sorted list of row indices
# 3. Cluster X-centers into columns
cols = cluster_values(x_centers, threshold) # Returns sorted list of col indices
# 4. Assign each cell_box to (row, col)
grid = {}
for i, cb in enumerate(cell_boxes):
row = find_cluster(y_centers[i], rows)
col = find_cluster(x_centers[i], cols)
grid[(row, col)] = {
'bbox': cb,
'index': i
}
# 5. Calculate actual widths/heights from boundaries
row_heights = [rows[i+1] - rows[i] for i in range(len(rows)-1)]
col_widths = [cols[i+1] - cols[i] for i in range(len(cols)-1)]
return grid, row_heights, col_widths
2. Content Extraction
The HTML content extraction should handle colspan/rowspan by flattening:
def extract_cell_contents(html: str) -> List[str]:
"""
Extract cell text contents from HTML in reading order.
Expands colspan/rowspan into repeated empty strings.
Returns:
List of text strings, one per logical cell position
"""
parser = HTMLTableParser()
parser.feed(html)
contents = []
for row in parser.tables[0]['rows']:
for cell in row['cells']:
contents.append(cell['text'])
# For colspan > 1, add empty strings for merged cells
for _ in range(cell.get('colspan', 1) - 1):
contents.append('')
return contents
3. Content-to-Cell Mapping Strategy
Recommended: Row-by-row Sequential Assignment
Since HTML content is in reading order (top-to-bottom, left-to-right), map content to grid cells in the same order:
def map_content_to_grid(grid, contents, num_rows, num_cols):
"""
Map extracted content to grid cells row by row.
"""
content_idx = 0
for row in range(num_rows):
for col in range(num_cols):
if (row, col) in grid:
if content_idx < len(contents):
grid[(row, col)]['content'] = contents[content_idx]
content_idx += 1
else:
grid[(row, col)]['content'] = ''
return grid
4. PDF Rendering Integration
Modify pdf_generator_service.py to use cell_boxes-first path:
def draw_table_region(self, ...):
cell_boxes = table_element.get('cell_boxes', [])
html_content = table_element.get('content', '')
if cell_boxes and settings.table_rendering_prefer_cellboxes:
# Try cell_boxes-first approach
grid, row_heights, col_widths = infer_grid_from_cellboxes(cell_boxes)
if grid:
# Extract content from HTML
contents = extract_cell_contents(html_content)
# Map content to grid
grid = map_content_to_grid(grid, contents, len(row_heights), len(col_widths))
# Render using cell_boxes coordinates
success = self._render_table_from_grid(
pdf_canvas, grid, row_heights, col_widths,
page_height, scale_w, scale_h
)
if success:
return # Done
# Fallback to existing HTML-based rendering
self._render_table_from_html(...)
Configuration
# config.py
class Settings:
# Table rendering strategy
table_rendering_prefer_cellboxes: bool = Field(
default=True,
description="Use cell_boxes coordinates as primary table structure source"
)
table_cellboxes_row_threshold: float = Field(
default=15.0,
description="Y-coordinate threshold for row clustering"
)
table_cellboxes_col_threshold: float = Field(
default=15.0,
description="X-coordinate threshold for column clustering"
)
Edge Cases
1. Empty cell_boxes
- Condition:
cell_boxesis empty or None - Action: Fall back to HTML-based rendering
2. Content Count Mismatch
- Condition: HTML has more/fewer cells than cell_boxes grid
- Action: Fill available cells, leave extras empty, log warning
3. Overlapping cell_boxes
- Condition: Multiple cell_boxes map to same grid position
- Action: Use first one, log warning
4. Single-cell Tables
- Condition: Only 1 cell_box detected
- Action: Render as single-cell table (valid case)
Testing Plan
-
Unit Tests
test_infer_grid_from_cellboxes: Various cell_box configurationstest_content_mapping: Content assignment scenarios
-
Integration Tests
test_scan_pdf_table_7: Verify the problematic table renders correctlytest_existing_tables: No regression on previously working tables
-
Visual Verification
- Compare PDF output before/after for
scan.pdf - Check table alignment and text placement
- Compare PDF output before/after for