Files
OCR/openspec/changes/fix-table-column-alignment/design.md
egg 940a406dce chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00

6.9 KiB

Design: Table Column Alignment Correction

Context

PP-Structure v3's table structure recognition model outputs HTML with row/col attributes inferred from visual patterns. However, the model frequently assigns incorrect column indices, especially for:

  • Tables with unclear left borders
  • Cells containing vertical Chinese text
  • Complex merged cells

This design introduces a post-processing correction layer that validates and fixes column assignments using geometric coordinates.

Goals / Non-Goals

Goals:

  • Correct column shift errors without modifying PP-Structure model
  • Use header row as authoritative column reference
  • Merge fragmented vertical text into proper cells
  • Maintain backward compatibility with existing pipeline

Non-Goals:

  • Training new OCR/structure models
  • Modifying PP-Structure's internal behavior
  • Handling tables without clear headers (future enhancement)

Architecture

PP-Structure Output
        │
        ▼
┌───────────────────┐
│ Table Column      │
│ Corrector         │
│ (new middleware)  │
├───────────────────┤
│ 1. Extract header │
│    column ranges  │
│ 2. Validate cells │
│ 3. Correct col    │
│    assignments    │
└───────────────────┘
        │
        ▼
   PDF Generator

Decisions

Decision 1: Header-Anchor Algorithm

Approach: Use first row (row_idx=0) cells as column anchors.

Algorithm:

def build_column_anchors(header_cells: List[Cell]) -> List[ColumnAnchor]:
    """
    Extract X-coordinate ranges from header row to define column boundaries.

    Returns:
        List of ColumnAnchor(col_idx, x_min, x_max)
    """
    anchors = []
    for cell in header_cells:
        anchors.append(ColumnAnchor(
            col_idx=cell.col_idx,
            x_min=cell.bbox.x0,
            x_max=cell.bbox.x1
        ))
    return sorted(anchors, key=lambda a: a.x_min)


def correct_column(cell: Cell, anchors: List[ColumnAnchor]) -> int:
    """
    Find the correct column index based on X-coordinate overlap.

    Strategy:
    1. Calculate overlap with each column anchor
    2. If overlap > 50% with different column, correct it
    3. If no overlap, find nearest column by center point
    """
    cell_center_x = (cell.bbox.x0 + cell.bbox.x1) / 2

    # Find best matching anchor
    best_anchor = None
    best_overlap = 0

    for anchor in anchors:
        overlap = calculate_x_overlap(cell.bbox, anchor)
        if overlap > best_overlap:
            best_overlap = overlap
            best_anchor = anchor

    # If significant overlap with different column, correct
    if best_anchor and best_overlap > 0.5:
        if best_anchor.col_idx != cell.col_idx:
            logger.info(f"Correcting cell col {cell.col_idx} -> {best_anchor.col_idx}")
            return best_anchor.col_idx

    return cell.col_idx

Why this approach:

  • Headers are typically the most accurately recognized row
  • X-coordinates are objective measurements, not semantic inference
  • Simple O(n*m) complexity (n cells, m columns)

Decision 2: Vertical Fragment Merging

Detection criteria for vertical text fragments:

  1. Width << Height (aspect ratio < 0.3)
  2. Located in leftmost 15% of table
  3. X-center deviation < 10px between consecutive blocks
  4. Y-gap < 20px (adjacent in vertical direction)

Merge strategy:

def merge_vertical_fragments(blocks: List[TextBlock], table_bbox: BBox) -> List[TextBlock]:
    """
    Merge vertically stacked narrow text blocks into single blocks.
    """
    # Filter candidates: narrow blocks in left margin
    left_boundary = table_bbox.x0 + (table_bbox.width * 0.15)
    candidates = [b for b in blocks
                  if b.width < b.height * 0.3
                  and b.center_x < left_boundary]

    # Sort by Y position
    candidates.sort(key=lambda b: b.y0)

    # Merge adjacent blocks
    merged = []
    current_group = []

    for block in candidates:
        if not current_group:
            current_group.append(block)
        elif should_merge(current_group[-1], block):
            current_group.append(block)
        else:
            merged.append(merge_group(current_group))
            current_group = [block]

    if current_group:
        merged.append(merge_group(current_group))

    return merged

Decision 3: Data Sources

Primary source: cell_boxes from PP-Structure

  • Contains accurate geometric coordinates for each detected cell
  • Independent of HTML structure recognition

Secondary source: HTML content with row/col attributes

  • Contains text content and structure
  • May have incorrect col assignments (the problem we're fixing)

Correlation: Match HTML cells to cell_boxes using IoU (Intersection over Union):

def match_html_cell_to_cellbox(html_cell: HtmlCell, cell_boxes: List[BBox]) -> Optional[BBox]:
    """Find the cell_box that best matches this HTML cell's position."""
    best_iou = 0
    best_box = None

    for box in cell_boxes:
        iou = calculate_iou(html_cell.inferred_bbox, box)
        if iou > best_iou:
            best_iou = iou
            best_box = box

    return best_box if best_iou > 0.3 else None

Configuration

# config.py additions
table_column_correction_enabled: bool = Field(
    default=True,
    description="Enable header-anchor column correction"
)
table_column_correction_threshold: float = Field(
    default=0.5,
    description="Minimum X-overlap ratio to trigger column correction"
)
vertical_fragment_merge_enabled: bool = Field(
    default=True,
    description="Enable vertical text fragment merging"
)
vertical_fragment_aspect_ratio: float = Field(
    default=0.3,
    description="Max width/height ratio to consider as vertical text"
)

Risks / Trade-offs

Risk Mitigation
Headers themselves misaligned Fall back to original column assignments
Multi-row headers Support colspan detection in header extraction
Tables without headers Skip correction, use original structure
Performance overhead O(n*m) is negligible for typical table sizes

Integration Points

  1. Input: PP-Structure's table_res containing:

    • cell_boxes: List of [x0, y0, x1, y1] coordinates
    • html: Table HTML with row/col attributes
  2. Output: Corrected table structure with:

    • Updated col indices in HTML cells
    • Merged vertical text blocks
    • Diagnostic logs for corrections made
  3. Trigger location: After PP-Structure table recognition, before PDF generation

    • File: pdf_generator_service.py
    • Method: draw_table_region() or new preprocessing step

Open Questions

  1. Q: How to handle tables where header row itself is misaligned? A: Could add a secondary validation using cell_boxes grid inference, but start simple.

  2. Q: Should corrections be logged for user review? A: Yes, add detailed logging with before/after column indices.