egg/OCR

Files

egg cfe65158a3 feat: enable document orientation detection for scanned PDFs

- Enable PP-StructureV3's use_doc_orientation_classify feature
- Detect rotation angle from doc_preprocessor_res.angle
- Swap page dimensions (width <-> height) for 90°/270° rotations
- Output PDF now correctly displays landscape-scanned content

Also includes:
- Archive completed openspec proposals
- Add simplify-frontend-ocr-config proposal (pending)
- Code cleanup and frontend simplification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 17:13:46 +08:00

10 KiB

Raw Blame History

Design: cell_boxes-First Table Rendering

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    Table Rendering Pipeline                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Input: table_element                                            │
│    ├── cell_boxes: [[x0,y0,x1,y1], ...]   (from PP-StructureV3)│
│    ├── html: "<table>...</table>"          (from PP-StructureV3)│
│    └── bbox: [x0, y0, x1, y1]              (table boundary)      │
│                                                                  │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │            Step 1: Grid Inference from cell_boxes          │ │
│  │                                                             │ │
│  │  cell_boxes → cluster by Y → rows                          │ │
│  │            → cluster by X → cols                           │ │
│  │            → build grid[row][col] = cell_bbox              │ │
│  └────────────────────────────────────────────────────────────┘ │
│                          │                                       │
│                          ▼                                       │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │            Step 2: Content Extraction from HTML            │ │
│  │                                                             │ │
│  │  html → parse → extract text list in reading order         │ │
│  │       → flatten colspan/rowspan → [text1, text2, ...]      │ │
│  └────────────────────────────────────────────────────────────┘ │
│                          │                                       │
│                          ▼                                       │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │            Step 3: Content-to-Cell Mapping                 │ │
│  │                                                             │ │
│  │  Option A: Sequential assignment (text[i] → cell[i])       │ │
│  │  Option B: Coordinate matching (text_bbox ∩ cell_bbox)     │ │
│  │  Option C: Row-by-row assignment                           │ │
│  └────────────────────────────────────────────────────────────┘ │
│                          │                                       │
│                          ▼                                       │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │            Step 4: PDF Rendering                           │ │
│  │                                                             │ │
│  │  For each cell in grid:                                    │ │
│  │    1. Draw cell border at cell_bbox coordinates            │ │
│  │    2. Render text content inside cell                      │ │
│  └────────────────────────────────────────────────────────────┘ │
│                                                                  │
│  Output: Table rendered in PDF with accurate cell boundaries     │
└─────────────────────────────────────────────────────────────────┘

Detailed Design

1. Grid Inference Algorithm

def infer_grid_from_cellboxes(cell_boxes: List[List[float]], threshold: float = 15.0):
    """
    Infer row/column grid structure from cell_boxes coordinates.

    Args:
        cell_boxes: List of [x0, y0, x1, y1] coordinates
        threshold: Clustering threshold for row/column grouping

    Returns:
        grid: Dict[Tuple[int,int], Dict] mapping (row, col) to cell info
        row_heights: List of row heights
        col_widths: List of column widths
    """
    # 1. Extract all Y-centers and X-centers
    y_centers = [(cb[1] + cb[3]) / 2 for cb in cell_boxes]
    x_centers = [(cb[0] + cb[2]) / 2 for cb in cell_boxes]

    # 2. Cluster Y-centers into rows
    rows = cluster_values(y_centers, threshold)  # Returns sorted list of row indices

    # 3. Cluster X-centers into columns
    cols = cluster_values(x_centers, threshold)  # Returns sorted list of col indices

    # 4. Assign each cell_box to (row, col)
    grid = {}
    for i, cb in enumerate(cell_boxes):
        row = find_cluster(y_centers[i], rows)
        col = find_cluster(x_centers[i], cols)
        grid[(row, col)] = {
            'bbox': cb,
            'index': i
        }

    # 5. Calculate actual widths/heights from boundaries
    row_heights = [rows[i+1] - rows[i] for i in range(len(rows)-1)]
    col_widths = [cols[i+1] - cols[i] for i in range(len(cols)-1)]

    return grid, row_heights, col_widths

2. Content Extraction

The HTML content extraction should handle colspan/rowspan by flattening:

def extract_cell_contents(html: str) -> List[str]:
    """
    Extract cell text contents from HTML in reading order.
    Expands colspan/rowspan into repeated empty strings.

    Returns:
        List of text strings, one per logical cell position
    """
    parser = HTMLTableParser()
    parser.feed(html)

    contents = []
    for row in parser.tables[0]['rows']:
        for cell in row['cells']:
            contents.append(cell['text'])
            # For colspan > 1, add empty strings for merged cells
            for _ in range(cell.get('colspan', 1) - 1):
                contents.append('')

    return contents

3. Content-to-Cell Mapping Strategy

Recommended: Row-by-row Sequential Assignment

Since HTML content is in reading order (top-to-bottom, left-to-right), map content to grid cells in the same order:

def map_content_to_grid(grid, contents, num_rows, num_cols):
    """
    Map extracted content to grid cells row by row.
    """
    content_idx = 0
    for row in range(num_rows):
        for col in range(num_cols):
            if (row, col) in grid:
                if content_idx < len(contents):
                    grid[(row, col)]['content'] = contents[content_idx]
                    content_idx += 1
                else:
                    grid[(row, col)]['content'] = ''

    return grid

4. PDF Rendering Integration

Modify pdf_generator_service.py to use cell_boxes-first path:

def draw_table_region(self, ...):
    cell_boxes = table_element.get('cell_boxes', [])
    html_content = table_element.get('content', '')

    if cell_boxes and settings.table_rendering_prefer_cellboxes:
        # Try cell_boxes-first approach
        grid, row_heights, col_widths = infer_grid_from_cellboxes(cell_boxes)

        if grid:
            # Extract content from HTML
            contents = extract_cell_contents(html_content)

            # Map content to grid
            grid = map_content_to_grid(grid, contents, len(row_heights), len(col_widths))

            # Render using cell_boxes coordinates
            success = self._render_table_from_grid(
                pdf_canvas, grid, row_heights, col_widths,
                page_height, scale_w, scale_h
            )

            if success:
                return  # Done

    # Fallback to existing HTML-based rendering
    self._render_table_from_html(...)

Configuration

# config.py
class Settings:
    # Table rendering strategy
    table_rendering_prefer_cellboxes: bool = Field(
        default=True,
        description="Use cell_boxes coordinates as primary table structure source"
    )

    table_cellboxes_row_threshold: float = Field(
        default=15.0,
        description="Y-coordinate threshold for row clustering"
    )

    table_cellboxes_col_threshold: float = Field(
        default=15.0,
        description="X-coordinate threshold for column clustering"
    )

Edge Cases

1. Empty cell_boxes

Condition: cell_boxes is empty or None
Action: Fall back to HTML-based rendering

2. Content Count Mismatch

Condition: HTML has more/fewer cells than cell_boxes grid
Action: Fill available cells, leave extras empty, log warning

3. Overlapping cell_boxes

Condition: Multiple cell_boxes map to same grid position
Action: Use first one, log warning

4. Single-cell Tables

Condition: Only 1 cell_box detected
Action: Render as single-cell table (valid case)

Testing Plan

Unit Tests
- test_infer_grid_from_cellboxes: Various cell_box configurations
- test_content_mapping: Content assignment scenarios
Integration Tests
- test_scan_pdf_table_7: Verify the problematic table renders correctly
- test_existing_tables: No regression on previously working tables
Visual Verification
- Compare PDF output before/after for scan.pdf
- Check table alignment and text placement

10 KiB Raw Blame History