# Design: cell_boxes-First Table Rendering ## Architecture Overview ``` ┌─────────────────────────────────────────────────────────────────┐ │ Table Rendering Pipeline │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Input: table_element │ │ ├── cell_boxes: [[x0,y0,x1,y1], ...] (from PP-StructureV3)│ │ ├── html: "...
" (from PP-StructureV3)│ │ └── bbox: [x0, y0, x1, y1] (table boundary) │ │ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Step 1: Grid Inference from cell_boxes │ │ │ │ │ │ │ │ cell_boxes → cluster by Y → rows │ │ │ │ → cluster by X → cols │ │ │ │ → build grid[row][col] = cell_bbox │ │ │ └────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Step 2: Content Extraction from HTML │ │ │ │ │ │ │ │ html → parse → extract text list in reading order │ │ │ │ → flatten colspan/rowspan → [text1, text2, ...] │ │ │ └────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Step 3: Content-to-Cell Mapping │ │ │ │ │ │ │ │ Option A: Sequential assignment (text[i] → cell[i]) │ │ │ │ Option B: Coordinate matching (text_bbox ∩ cell_bbox) │ │ │ │ Option C: Row-by-row assignment │ │ │ └────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Step 4: PDF Rendering │ │ │ │ │ │ │ │ For each cell in grid: │ │ │ │ 1. Draw cell border at cell_bbox coordinates │ │ │ │ 2. Render text content inside cell │ │ │ └────────────────────────────────────────────────────────────┘ │ │ │ │ Output: Table rendered in PDF with accurate cell boundaries │ └─────────────────────────────────────────────────────────────────┘ ``` ## Detailed Design ### 1. Grid Inference Algorithm ```python def infer_grid_from_cellboxes(cell_boxes: List[List[float]], threshold: float = 15.0): """ Infer row/column grid structure from cell_boxes coordinates. Args: cell_boxes: List of [x0, y0, x1, y1] coordinates threshold: Clustering threshold for row/column grouping Returns: grid: Dict[Tuple[int,int], Dict] mapping (row, col) to cell info row_heights: List of row heights col_widths: List of column widths """ # 1. Extract all Y-centers and X-centers y_centers = [(cb[1] + cb[3]) / 2 for cb in cell_boxes] x_centers = [(cb[0] + cb[2]) / 2 for cb in cell_boxes] # 2. Cluster Y-centers into rows rows = cluster_values(y_centers, threshold) # Returns sorted list of row indices # 3. Cluster X-centers into columns cols = cluster_values(x_centers, threshold) # Returns sorted list of col indices # 4. Assign each cell_box to (row, col) grid = {} for i, cb in enumerate(cell_boxes): row = find_cluster(y_centers[i], rows) col = find_cluster(x_centers[i], cols) grid[(row, col)] = { 'bbox': cb, 'index': i } # 5. Calculate actual widths/heights from boundaries row_heights = [rows[i+1] - rows[i] for i in range(len(rows)-1)] col_widths = [cols[i+1] - cols[i] for i in range(len(cols)-1)] return grid, row_heights, col_widths ``` ### 2. Content Extraction The HTML content extraction should handle colspan/rowspan by flattening: ```python def extract_cell_contents(html: str) -> List[str]: """ Extract cell text contents from HTML in reading order. Expands colspan/rowspan into repeated empty strings. Returns: List of text strings, one per logical cell position """ parser = HTMLTableParser() parser.feed(html) contents = [] for row in parser.tables[0]['rows']: for cell in row['cells']: contents.append(cell['text']) # For colspan > 1, add empty strings for merged cells for _ in range(cell.get('colspan', 1) - 1): contents.append('') return contents ``` ### 3. Content-to-Cell Mapping Strategy **Recommended: Row-by-row Sequential Assignment** Since HTML content is in reading order (top-to-bottom, left-to-right), map content to grid cells in the same order: ```python def map_content_to_grid(grid, contents, num_rows, num_cols): """ Map extracted content to grid cells row by row. """ content_idx = 0 for row in range(num_rows): for col in range(num_cols): if (row, col) in grid: if content_idx < len(contents): grid[(row, col)]['content'] = contents[content_idx] content_idx += 1 else: grid[(row, col)]['content'] = '' return grid ``` ### 4. PDF Rendering Integration Modify `pdf_generator_service.py` to use cell_boxes-first path: ```python def draw_table_region(self, ...): cell_boxes = table_element.get('cell_boxes', []) html_content = table_element.get('content', '') if cell_boxes and settings.table_rendering_prefer_cellboxes: # Try cell_boxes-first approach grid, row_heights, col_widths = infer_grid_from_cellboxes(cell_boxes) if grid: # Extract content from HTML contents = extract_cell_contents(html_content) # Map content to grid grid = map_content_to_grid(grid, contents, len(row_heights), len(col_widths)) # Render using cell_boxes coordinates success = self._render_table_from_grid( pdf_canvas, grid, row_heights, col_widths, page_height, scale_w, scale_h ) if success: return # Done # Fallback to existing HTML-based rendering self._render_table_from_html(...) ``` ## Configuration ```python # config.py class Settings: # Table rendering strategy table_rendering_prefer_cellboxes: bool = Field( default=True, description="Use cell_boxes coordinates as primary table structure source" ) table_cellboxes_row_threshold: float = Field( default=15.0, description="Y-coordinate threshold for row clustering" ) table_cellboxes_col_threshold: float = Field( default=15.0, description="X-coordinate threshold for column clustering" ) ``` ## Edge Cases ### 1. Empty cell_boxes - **Condition**: `cell_boxes` is empty or None - **Action**: Fall back to HTML-based rendering ### 2. Content Count Mismatch - **Condition**: HTML has more/fewer cells than cell_boxes grid - **Action**: Fill available cells, leave extras empty, log warning ### 3. Overlapping cell_boxes - **Condition**: Multiple cell_boxes map to same grid position - **Action**: Use first one, log warning ### 4. Single-cell Tables - **Condition**: Only 1 cell_box detected - **Action**: Render as single-cell table (valid case) ## Testing Plan 1. **Unit Tests** - `test_infer_grid_from_cellboxes`: Various cell_box configurations - `test_content_mapping`: Content assignment scenarios 2. **Integration Tests** - `test_scan_pdf_table_7`: Verify the problematic table renders correctly - `test_existing_tables`: No regression on previously working tables 3. **Visual Verification** - Compare PDF output before/after for `scan.pdf` - Check table alignment and text placement