- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
228 lines
6.9 KiB
Markdown
228 lines
6.9 KiB
Markdown
# Design: Table Column Alignment Correction
|
|
|
|
## Context
|
|
|
|
PP-Structure v3's table structure recognition model outputs HTML with row/col attributes inferred from visual patterns. However, the model frequently assigns incorrect column indices, especially for:
|
|
- Tables with unclear left borders
|
|
- Cells containing vertical Chinese text
|
|
- Complex merged cells
|
|
|
|
This design introduces a **post-processing correction layer** that validates and fixes column assignments using geometric coordinates.
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Correct column shift errors without modifying PP-Structure model
|
|
- Use header row as authoritative column reference
|
|
- Merge fragmented vertical text into proper cells
|
|
- Maintain backward compatibility with existing pipeline
|
|
|
|
**Non-Goals:**
|
|
- Training new OCR/structure models
|
|
- Modifying PP-Structure's internal behavior
|
|
- Handling tables without clear headers (future enhancement)
|
|
|
|
## Architecture
|
|
|
|
```
|
|
PP-Structure Output
|
|
│
|
|
▼
|
|
┌───────────────────┐
|
|
│ Table Column │
|
|
│ Corrector │
|
|
│ (new middleware) │
|
|
├───────────────────┤
|
|
│ 1. Extract header │
|
|
│ column ranges │
|
|
│ 2. Validate cells │
|
|
│ 3. Correct col │
|
|
│ assignments │
|
|
└───────────────────┘
|
|
│
|
|
▼
|
|
PDF Generator
|
|
```
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: Header-Anchor Algorithm
|
|
|
|
**Approach:** Use first row (row_idx=0) cells as column anchors.
|
|
|
|
**Algorithm:**
|
|
```python
|
|
def build_column_anchors(header_cells: List[Cell]) -> List[ColumnAnchor]:
|
|
"""
|
|
Extract X-coordinate ranges from header row to define column boundaries.
|
|
|
|
Returns:
|
|
List of ColumnAnchor(col_idx, x_min, x_max)
|
|
"""
|
|
anchors = []
|
|
for cell in header_cells:
|
|
anchors.append(ColumnAnchor(
|
|
col_idx=cell.col_idx,
|
|
x_min=cell.bbox.x0,
|
|
x_max=cell.bbox.x1
|
|
))
|
|
return sorted(anchors, key=lambda a: a.x_min)
|
|
|
|
|
|
def correct_column(cell: Cell, anchors: List[ColumnAnchor]) -> int:
|
|
"""
|
|
Find the correct column index based on X-coordinate overlap.
|
|
|
|
Strategy:
|
|
1. Calculate overlap with each column anchor
|
|
2. If overlap > 50% with different column, correct it
|
|
3. If no overlap, find nearest column by center point
|
|
"""
|
|
cell_center_x = (cell.bbox.x0 + cell.bbox.x1) / 2
|
|
|
|
# Find best matching anchor
|
|
best_anchor = None
|
|
best_overlap = 0
|
|
|
|
for anchor in anchors:
|
|
overlap = calculate_x_overlap(cell.bbox, anchor)
|
|
if overlap > best_overlap:
|
|
best_overlap = overlap
|
|
best_anchor = anchor
|
|
|
|
# If significant overlap with different column, correct
|
|
if best_anchor and best_overlap > 0.5:
|
|
if best_anchor.col_idx != cell.col_idx:
|
|
logger.info(f"Correcting cell col {cell.col_idx} -> {best_anchor.col_idx}")
|
|
return best_anchor.col_idx
|
|
|
|
return cell.col_idx
|
|
```
|
|
|
|
**Why this approach:**
|
|
- Headers are typically the most accurately recognized row
|
|
- X-coordinates are objective measurements, not semantic inference
|
|
- Simple O(n*m) complexity (n cells, m columns)
|
|
|
|
### Decision 2: Vertical Fragment Merging
|
|
|
|
**Detection criteria for vertical text fragments:**
|
|
1. Width << Height (aspect ratio < 0.3)
|
|
2. Located in leftmost 15% of table
|
|
3. X-center deviation < 10px between consecutive blocks
|
|
4. Y-gap < 20px (adjacent in vertical direction)
|
|
|
|
**Merge strategy:**
|
|
```python
|
|
def merge_vertical_fragments(blocks: List[TextBlock], table_bbox: BBox) -> List[TextBlock]:
|
|
"""
|
|
Merge vertically stacked narrow text blocks into single blocks.
|
|
"""
|
|
# Filter candidates: narrow blocks in left margin
|
|
left_boundary = table_bbox.x0 + (table_bbox.width * 0.15)
|
|
candidates = [b for b in blocks
|
|
if b.width < b.height * 0.3
|
|
and b.center_x < left_boundary]
|
|
|
|
# Sort by Y position
|
|
candidates.sort(key=lambda b: b.y0)
|
|
|
|
# Merge adjacent blocks
|
|
merged = []
|
|
current_group = []
|
|
|
|
for block in candidates:
|
|
if not current_group:
|
|
current_group.append(block)
|
|
elif should_merge(current_group[-1], block):
|
|
current_group.append(block)
|
|
else:
|
|
merged.append(merge_group(current_group))
|
|
current_group = [block]
|
|
|
|
if current_group:
|
|
merged.append(merge_group(current_group))
|
|
|
|
return merged
|
|
```
|
|
|
|
### Decision 3: Data Sources
|
|
|
|
**Primary source:** `cell_boxes` from PP-Structure
|
|
- Contains accurate geometric coordinates for each detected cell
|
|
- Independent of HTML structure recognition
|
|
|
|
**Secondary source:** HTML content with row/col attributes
|
|
- Contains text content and structure
|
|
- May have incorrect col assignments (the problem we're fixing)
|
|
|
|
**Correlation:** Match HTML cells to cell_boxes using IoU (Intersection over Union):
|
|
```python
|
|
def match_html_cell_to_cellbox(html_cell: HtmlCell, cell_boxes: List[BBox]) -> Optional[BBox]:
|
|
"""Find the cell_box that best matches this HTML cell's position."""
|
|
best_iou = 0
|
|
best_box = None
|
|
|
|
for box in cell_boxes:
|
|
iou = calculate_iou(html_cell.inferred_bbox, box)
|
|
if iou > best_iou:
|
|
best_iou = iou
|
|
best_box = box
|
|
|
|
return best_box if best_iou > 0.3 else None
|
|
```
|
|
|
|
## Configuration
|
|
|
|
```python
|
|
# config.py additions
|
|
table_column_correction_enabled: bool = Field(
|
|
default=True,
|
|
description="Enable header-anchor column correction"
|
|
)
|
|
table_column_correction_threshold: float = Field(
|
|
default=0.5,
|
|
description="Minimum X-overlap ratio to trigger column correction"
|
|
)
|
|
vertical_fragment_merge_enabled: bool = Field(
|
|
default=True,
|
|
description="Enable vertical text fragment merging"
|
|
)
|
|
vertical_fragment_aspect_ratio: float = Field(
|
|
default=0.3,
|
|
description="Max width/height ratio to consider as vertical text"
|
|
)
|
|
```
|
|
|
|
## Risks / Trade-offs
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Headers themselves misaligned | Fall back to original column assignments |
|
|
| Multi-row headers | Support colspan detection in header extraction |
|
|
| Tables without headers | Skip correction, use original structure |
|
|
| Performance overhead | O(n*m) is negligible for typical table sizes |
|
|
|
|
## Integration Points
|
|
|
|
1. **Input:** PP-Structure's `table_res` containing:
|
|
- `cell_boxes`: List of [x0, y0, x1, y1] coordinates
|
|
- `html`: Table HTML with row/col attributes
|
|
|
|
2. **Output:** Corrected table structure with:
|
|
- Updated col indices in HTML cells
|
|
- Merged vertical text blocks
|
|
- Diagnostic logs for corrections made
|
|
|
|
3. **Trigger location:** After PP-Structure table recognition, before PDF generation
|
|
- File: `pdf_generator_service.py`
|
|
- Method: `draw_table_region()` or new preprocessing step
|
|
|
|
## Open Questions
|
|
|
|
1. **Q:** How to handle tables where header row itself is misaligned?
|
|
**A:** Could add a secondary validation using cell_boxes grid inference, but start simple.
|
|
|
|
2. **Q:** Should corrections be logged for user review?
|
|
**A:** Yes, add detailed logging with before/after column indices.
|