# Design: Table Column Alignment Correction ## Context PP-Structure v3's table structure recognition model outputs HTML with row/col attributes inferred from visual patterns. However, the model frequently assigns incorrect column indices, especially for: - Tables with unclear left borders - Cells containing vertical Chinese text - Complex merged cells This design introduces a **post-processing correction layer** that validates and fixes column assignments using geometric coordinates. ## Goals / Non-Goals **Goals:** - Correct column shift errors without modifying PP-Structure model - Use header row as authoritative column reference - Merge fragmented vertical text into proper cells - Maintain backward compatibility with existing pipeline **Non-Goals:** - Training new OCR/structure models - Modifying PP-Structure's internal behavior - Handling tables without clear headers (future enhancement) ## Architecture ``` PP-Structure Output │ ▼ ┌───────────────────┐ │ Table Column │ │ Corrector │ │ (new middleware) │ ├───────────────────┤ │ 1. Extract header │ │ column ranges │ │ 2. Validate cells │ │ 3. Correct col │ │ assignments │ └───────────────────┘ │ ▼ PDF Generator ``` ## Decisions ### Decision 1: Header-Anchor Algorithm **Approach:** Use first row (row_idx=0) cells as column anchors. **Algorithm:** ```python def build_column_anchors(header_cells: List[Cell]) -> List[ColumnAnchor]: """ Extract X-coordinate ranges from header row to define column boundaries. Returns: List of ColumnAnchor(col_idx, x_min, x_max) """ anchors = [] for cell in header_cells: anchors.append(ColumnAnchor( col_idx=cell.col_idx, x_min=cell.bbox.x0, x_max=cell.bbox.x1 )) return sorted(anchors, key=lambda a: a.x_min) def correct_column(cell: Cell, anchors: List[ColumnAnchor]) -> int: """ Find the correct column index based on X-coordinate overlap. Strategy: 1. Calculate overlap with each column anchor 2. If overlap > 50% with different column, correct it 3. If no overlap, find nearest column by center point """ cell_center_x = (cell.bbox.x0 + cell.bbox.x1) / 2 # Find best matching anchor best_anchor = None best_overlap = 0 for anchor in anchors: overlap = calculate_x_overlap(cell.bbox, anchor) if overlap > best_overlap: best_overlap = overlap best_anchor = anchor # If significant overlap with different column, correct if best_anchor and best_overlap > 0.5: if best_anchor.col_idx != cell.col_idx: logger.info(f"Correcting cell col {cell.col_idx} -> {best_anchor.col_idx}") return best_anchor.col_idx return cell.col_idx ``` **Why this approach:** - Headers are typically the most accurately recognized row - X-coordinates are objective measurements, not semantic inference - Simple O(n*m) complexity (n cells, m columns) ### Decision 2: Vertical Fragment Merging **Detection criteria for vertical text fragments:** 1. Width << Height (aspect ratio < 0.3) 2. Located in leftmost 15% of table 3. X-center deviation < 10px between consecutive blocks 4. Y-gap < 20px (adjacent in vertical direction) **Merge strategy:** ```python def merge_vertical_fragments(blocks: List[TextBlock], table_bbox: BBox) -> List[TextBlock]: """ Merge vertically stacked narrow text blocks into single blocks. """ # Filter candidates: narrow blocks in left margin left_boundary = table_bbox.x0 + (table_bbox.width * 0.15) candidates = [b for b in blocks if b.width < b.height * 0.3 and b.center_x < left_boundary] # Sort by Y position candidates.sort(key=lambda b: b.y0) # Merge adjacent blocks merged = [] current_group = [] for block in candidates: if not current_group: current_group.append(block) elif should_merge(current_group[-1], block): current_group.append(block) else: merged.append(merge_group(current_group)) current_group = [block] if current_group: merged.append(merge_group(current_group)) return merged ``` ### Decision 3: Data Sources **Primary source:** `cell_boxes` from PP-Structure - Contains accurate geometric coordinates for each detected cell - Independent of HTML structure recognition **Secondary source:** HTML content with row/col attributes - Contains text content and structure - May have incorrect col assignments (the problem we're fixing) **Correlation:** Match HTML cells to cell_boxes using IoU (Intersection over Union): ```python def match_html_cell_to_cellbox(html_cell: HtmlCell, cell_boxes: List[BBox]) -> Optional[BBox]: """Find the cell_box that best matches this HTML cell's position.""" best_iou = 0 best_box = None for box in cell_boxes: iou = calculate_iou(html_cell.inferred_bbox, box) if iou > best_iou: best_iou = iou best_box = box return best_box if best_iou > 0.3 else None ``` ## Configuration ```python # config.py additions table_column_correction_enabled: bool = Field( default=True, description="Enable header-anchor column correction" ) table_column_correction_threshold: float = Field( default=0.5, description="Minimum X-overlap ratio to trigger column correction" ) vertical_fragment_merge_enabled: bool = Field( default=True, description="Enable vertical text fragment merging" ) vertical_fragment_aspect_ratio: float = Field( default=0.3, description="Max width/height ratio to consider as vertical text" ) ``` ## Risks / Trade-offs | Risk | Mitigation | |------|------------| | Headers themselves misaligned | Fall back to original column assignments | | Multi-row headers | Support colspan detection in header extraction | | Tables without headers | Skip correction, use original structure | | Performance overhead | O(n*m) is negligible for typical table sizes | ## Integration Points 1. **Input:** PP-Structure's `table_res` containing: - `cell_boxes`: List of [x0, y0, x1, y1] coordinates - `html`: Table HTML with row/col attributes 2. **Output:** Corrected table structure with: - Updated col indices in HTML cells - Merged vertical text blocks - Diagnostic logs for corrections made 3. **Trigger location:** After PP-Structure table recognition, before PDF generation - File: `pdf_generator_service.py` - Method: `draw_table_region()` or new preprocessing step ## Open Questions 1. **Q:** How to handle tables where header row itself is misaligned? **A:** Could add a secondary validation using cell_boxes grid inference, but start simple. 2. **Q:** Should corrections be logged for user review? **A:** Yes, add detailed logging with before/after column indices.