feat: enable document orientation detection for scanned PDFs

- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 17:13:46 +08:00
parent 57070af307
commit cfe65158a3
58 changed files with 1271 additions and 3048 deletions
--- a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/design.md
+++ b/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/design.md
@@ -0,0 +1,234 @@
+# Design: cell_boxes-First Table Rendering
+
+## Architecture Overview
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                    Table Rendering Pipeline                      │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                  │
+│  Input: table_element                                            │
+│    ├── cell_boxes: [[x0,y0,x1,y1], ...]   (from PP-StructureV3)│
+│    ├── html: "<table>...</table>"          (from PP-StructureV3)│
+│    └── bbox: [x0, y0, x1, y1]              (table boundary)      │
+│                                                                  │
+│  ┌────────────────────────────────────────────────────────────┐ │
+│  │            Step 1: Grid Inference from cell_boxes          │ │
+│  │                                                             │ │
+│  │  cell_boxes → cluster by Y → rows                          │ │
+│  │            → cluster by X → cols                           │ │
+│  │            → build grid[row][col] = cell_bbox              │ │
+│  └────────────────────────────────────────────────────────────┘ │
+│                          │                                       │
+│                          ▼                                       │
+│  ┌────────────────────────────────────────────────────────────┐ │
+│  │            Step 2: Content Extraction from HTML            │ │
+│  │                                                             │ │
+│  │  html → parse → extract text list in reading order         │ │
+│  │       → flatten colspan/rowspan → [text1, text2, ...]      │ │
+│  └────────────────────────────────────────────────────────────┘ │
+│                          │                                       │
+│                          ▼                                       │
+│  ┌────────────────────────────────────────────────────────────┐ │
+│  │            Step 3: Content-to-Cell Mapping                 │ │
+│  │                                                             │ │
+│  │  Option A: Sequential assignment (text[i] → cell[i])       │ │
+│  │  Option B: Coordinate matching (text_bbox ∩ cell_bbox)     │ │
+│  │  Option C: Row-by-row assignment                           │ │
+│  └────────────────────────────────────────────────────────────┘ │
+│                          │                                       │
+│                          ▼                                       │
+│  ┌────────────────────────────────────────────────────────────┐ │
+│  │            Step 4: PDF Rendering                           │ │
+│  │                                                             │ │
+│  │  For each cell in grid:                                    │ │
+│  │    1. Draw cell border at cell_bbox coordinates            │ │
+│  │    2. Render text content inside cell                      │ │
+│  └────────────────────────────────────────────────────────────┘ │
+│                                                                  │
+│  Output: Table rendered in PDF with accurate cell boundaries     │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Detailed Design
+
+### 1. Grid Inference Algorithm
+
+```python
+def infer_grid_from_cellboxes(cell_boxes: List[List[float]], threshold: float = 15.0):
+    """
+    Infer row/column grid structure from cell_boxes coordinates.
+
+    Args:
+        cell_boxes: List of [x0, y0, x1, y1] coordinates
+        threshold: Clustering threshold for row/column grouping
+
+    Returns:
+        grid: Dict[Tuple[int,int], Dict] mapping (row, col) to cell info
+        row_heights: List of row heights
+        col_widths: List of column widths
+    """
+    # 1. Extract all Y-centers and X-centers
+    y_centers = [(cb[1] + cb[3]) / 2 for cb in cell_boxes]
+    x_centers = [(cb[0] + cb[2]) / 2 for cb in cell_boxes]
+
+    # 2. Cluster Y-centers into rows
+    rows = cluster_values(y_centers, threshold)  # Returns sorted list of row indices
+
+    # 3. Cluster X-centers into columns
+    cols = cluster_values(x_centers, threshold)  # Returns sorted list of col indices
+
+    # 4. Assign each cell_box to (row, col)
+    grid = {}
+    for i, cb in enumerate(cell_boxes):
+        row = find_cluster(y_centers[i], rows)
+        col = find_cluster(x_centers[i], cols)
+        grid[(row, col)] = {
+            'bbox': cb,
+            'index': i
+        }
+
+    # 5. Calculate actual widths/heights from boundaries
+    row_heights = [rows[i+1] - rows[i] for i in range(len(rows)-1)]
+    col_widths = [cols[i+1] - cols[i] for i in range(len(cols)-1)]
+
+    return grid, row_heights, col_widths
+```
+
+### 2. Content Extraction
+
+The HTML content extraction should handle colspan/rowspan by flattening:
+
+```python
+def extract_cell_contents(html: str) -> List[str]:
+    """
+    Extract cell text contents from HTML in reading order.
+    Expands colspan/rowspan into repeated empty strings.
+
+    Returns:
+        List of text strings, one per logical cell position
+    """
+    parser = HTMLTableParser()
+    parser.feed(html)
+
+    contents = []
+    for row in parser.tables[0]['rows']:
+        for cell in row['cells']:
+            contents.append(cell['text'])
+            # For colspan > 1, add empty strings for merged cells
+            for _ in range(cell.get('colspan', 1) - 1):
+                contents.append('')
+
+    return contents
+```
+
+### 3. Content-to-Cell Mapping Strategy
+
+**Recommended: Row-by-row Sequential Assignment**
+
+Since HTML content is in reading order (top-to-bottom, left-to-right), map content to grid cells in the same order:
+
+```python
+def map_content_to_grid(grid, contents, num_rows, num_cols):
+    """
+    Map extracted content to grid cells row by row.
+    """
+    content_idx = 0
+    for row in range(num_rows):
+        for col in range(num_cols):
+            if (row, col) in grid:
+                if content_idx < len(contents):
+                    grid[(row, col)]['content'] = contents[content_idx]
+                    content_idx += 1
+                else:
+                    grid[(row, col)]['content'] = ''
+
+    return grid
+```
+
+### 4. PDF Rendering Integration
+
+Modify `pdf_generator_service.py` to use cell_boxes-first path:
+
+```python
+def draw_table_region(self, ...):
+    cell_boxes = table_element.get('cell_boxes', [])
+    html_content = table_element.get('content', '')
+
+    if cell_boxes and settings.table_rendering_prefer_cellboxes:
+        # Try cell_boxes-first approach
+        grid, row_heights, col_widths = infer_grid_from_cellboxes(cell_boxes)
+
+        if grid:
+            # Extract content from HTML
+            contents = extract_cell_contents(html_content)
+
+            # Map content to grid
+            grid = map_content_to_grid(grid, contents, len(row_heights), len(col_widths))
+
+            # Render using cell_boxes coordinates
+            success = self._render_table_from_grid(
+                pdf_canvas, grid, row_heights, col_widths,
+                page_height, scale_w, scale_h
+            )
+
+            if success:
+                return  # Done
+
+    # Fallback to existing HTML-based rendering
+    self._render_table_from_html(...)
+```
+
+## Configuration
+
+```python
+# config.py
+class Settings:
+    # Table rendering strategy
+    table_rendering_prefer_cellboxes: bool = Field(
+        default=True,
+        description="Use cell_boxes coordinates as primary table structure source"
+    )
+
+    table_cellboxes_row_threshold: float = Field(
+        default=15.0,
+        description="Y-coordinate threshold for row clustering"
+    )
+
+    table_cellboxes_col_threshold: float = Field(
+        default=15.0,
+        description="X-coordinate threshold for column clustering"
+    )
+```
+
+## Edge Cases
+
+### 1. Empty cell_boxes
+- **Condition**: `cell_boxes` is empty or None
+- **Action**: Fall back to HTML-based rendering
+
+### 2. Content Count Mismatch
+- **Condition**: HTML has more/fewer cells than cell_boxes grid
+- **Action**: Fill available cells, leave extras empty, log warning
+
+### 3. Overlapping cell_boxes
+- **Condition**: Multiple cell_boxes map to same grid position
+- **Action**: Use first one, log warning
+
+### 4. Single-cell Tables
+- **Condition**: Only 1 cell_box detected
+- **Action**: Render as single-cell table (valid case)
+
+## Testing Plan
+
+1. **Unit Tests**
+   - `test_infer_grid_from_cellboxes`: Various cell_box configurations
+   - `test_content_mapping`: Content assignment scenarios
+
+2. **Integration Tests**
+   - `test_scan_pdf_table_7`: Verify the problematic table renders correctly
+   - `test_existing_tables`: No regression on previously working tables
+
+3. **Visual Verification**
+   - Compare PDF output before/after for `scan.pdf`
+   - Check table alignment and text placement
--- a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/proposal.md
+++ b/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/proposal.md
@@ -0,0 +1,75 @@
+# Proposal: Use cell_boxes as Primary Table Rendering Source
+
+## Summary
+
+Modify table PDF rendering to use `cell_boxes` coordinates as the primary source for table structure instead of relying on HTML table parsing. This resolves grid mismatch issues where PP-StructureV3's HTML structure (with colspan/rowspan) doesn't match the cell_boxes coordinate grid.
+
+## Problem Statement
+
+### Current Issue
+
+When processing `scan.pdf`, PP-StructureV3 detected tables with the following characteristics:
+
+**Table 7 (Element 7)**:
+- `cell_boxes`: 27 cells forming an 11x10 grid (by coordinate clustering)
+- HTML structure: 9 rows with irregular columns `[7, 7, 1, 3, 3, 3, 3, 3, 1]` due to colspan
+
+This **grid mismatch** causes:
+1. `_compute_table_grid_from_cell_boxes()` returns `None, None`
+2. PDF generator falls back to ReportLab Table with equal column distribution
+3. Table renders with incorrect column widths, causing visual misalignment
+
+### Root Cause
+
+PP-StructureV3 sometimes merges multiple visual tables into one large table region:
+- The cell_boxes accurately detect individual cell boundaries
+- The HTML uses colspan to represent merged cells, but the grid doesn't match cell_boxes
+- Current logic requires exact grid match, which fails for complex merged tables
+
+## Proposed Solution
+
+### Strategy: cell_boxes-First Rendering
+
+Instead of requiring HTML grid to match cell_boxes, **use cell_boxes directly** as the authoritative source for cell boundaries:
+
+1. **Grid Inference from cell_boxes**
+   - Cluster cell_boxes by Y-coordinate to determine rows
+   - Cluster cell_boxes by X-coordinate to determine columns
+   - Build a row×col grid map from cell_boxes positions
+
+2. **Content Assignment from HTML**
+   - Extract text content from HTML in reading order
+   - Map text content to cell_boxes positions using coordinate matching
+   - Handle cases where HTML has fewer/more cells than cell_boxes
+
+3. **Direct PDF Rendering**
+   - Render table borders using cell_boxes coordinates (already implemented)
+   - Place text content at calculated cell positions
+   - Skip ReportLab Table parsing when cell_boxes grid is valid
+
+### Key Changes
+
+| Component | Change |
+|-----------|--------|
+| `pdf_generator_service.py` | Add cell_boxes-first rendering path |
+| `table_content_rebuilder.py` | Enhance to support grid-based content mapping |
+| `config.py` | Add `table_rendering_prefer_cellboxes: bool` setting |
+
+## Benefits
+
+1. **Accurate Table Borders**: cell_boxes from ML detection are more precise than HTML parsing
+2. **Handles Grid Mismatch**: Works even when HTML colspan/rowspan don't match cell count
+3. **Consistent Output**: Same rendering logic regardless of HTML complexity
+4. **Backward Compatible**: Existing HTML-based rendering remains as fallback
+
+## Non-Goals
+
+- Not modifying PP-StructureV3 detection logic
+- Not implementing table splitting (separate proposal if needed)
+- Not changing Direct track (PyMuPDF) table extraction
+
+## Success Criteria
+
+1. `scan.pdf` Table 7 renders with correct column widths based on cell_boxes
+2. All existing table tests continue to pass
+3. No regression for tables where HTML grid matches cell_boxes
--- a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/specs/document-processing/spec.md
+++ b/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/specs/document-processing/spec.md
@@ -0,0 +1,36 @@
+# document-processing Specification Delta
+
+## MODIFIED Requirements
+
+### Requirement: Extract table structure (Modified)
+
+The system SHALL use cell_boxes coordinates as the primary source for table structure when rendering PDFs, with HTML parsing as fallback.
+
+#### Scenario: Render table using cell_boxes grid
+- **WHEN** rendering a table element to PDF
+- **AND** the table has valid cell_boxes coordinates
+- **AND** `table_rendering_prefer_cellboxes` is enabled
+- **THEN** the system SHALL infer row/column grid from cell_boxes coordinates
+- **AND** extract text content from HTML in reading order
+- **AND** map content to grid cells by position
+- **AND** render table borders using cell_boxes coordinates
+- **AND** place text content within calculated cell boundaries
+
+#### Scenario: Handle cell_boxes grid mismatch gracefully
+- **WHEN** cell_boxes grid has different dimensions than HTML colspan/rowspan structure
+- **THEN** the system SHALL use cell_boxes grid as authoritative structure
+- **AND** map available HTML content to cells row-by-row
+- **AND** leave unmapped cells empty
+- **AND** log warning if content count differs significantly
+
+#### Scenario: Fallback to HTML-based rendering
+- **WHEN** cell_boxes is empty or None
+- **OR** `table_rendering_prefer_cellboxes` is disabled
+- **OR** cell_boxes grid inference fails
+- **THEN** the system SHALL fall back to existing HTML-based table rendering
+- **AND** use ReportLab Table with parsed HTML structure
+
+#### Scenario: Maintain backward compatibility
+- **WHEN** processing tables where cell_boxes grid matches HTML structure
+- **THEN** the system SHALL produce identical output to previous behavior
+- **AND** pass all existing table rendering tests
--- a/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/tasks.md
+++ b/openspec/changes/archive/2025-12-11-use-cellboxes-for-table-rendering/tasks.md
@@ -0,0 +1,48 @@
+## 1. Core Algorithm Implementation
+
+### 1.1 Grid Inference Module
+- [x] 1.1.1 Create `CellBoxGridInferrer` class in `pdf_table_renderer.py`
+- [x] 1.1.2 Implement `cluster_values()` for Y/X coordinate clustering
+- [x] 1.1.3 Implement `infer_grid_from_cellboxes()` main method
+- [x] 1.1.4 Add row_heights and col_widths calculation
+
+### 1.2 Content Mapping
+- [x] 1.2.1 Implement `extract_cell_contents()` from HTML
+- [x] 1.2.2 Implement `map_content_to_grid()` for row-by-row assignment
+- [x] 1.2.3 Handle content count mismatch (more/fewer cells)
+
+## 2. PDF Generator Integration
+
+### 2.1 New Rendering Path
+- [x] 2.1.1 Add `render_from_cellboxes_grid()` method to TableRenderer
+- [x] 2.1.2 Integrate into `draw_table_region()` with cellboxes-first check
+- [x] 2.1.3 Maintain fallback to existing HTML-based rendering
+
+### 2.2 Cell Rendering
+- [x] 2.2.1 Draw cell borders using cell_boxes coordinates
+- [x] 2.2.2 Render text content with proper alignment and padding
+- [x] 2.2.3 Handle multi-line text within cells
+
+## 3. Configuration
+
+### 3.1 Settings
+- [x] 3.1.1 Add `table_rendering_prefer_cellboxes: bool = True`
+- [x] 3.1.2 Add `table_cellboxes_row_threshold: float = 15.0`
+- [x] 3.1.3 Add `table_cellboxes_col_threshold: float = 15.0`
+
+## 4. Testing
+
+### 4.1 Unit Tests
+- [x] 4.1.1 Test grid inference with various cell_box configurations
+- [x] 4.1.2 Test content mapping edge cases
+- [x] 4.1.3 Test coordinate clustering accuracy
+
+### 4.2 Integration Tests
+- [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case)
+- [ ] 4.2.2 Verify no regression on existing table tests
+- [ ] 4.2.3 Visual comparison of output PDFs
+
+## 5. Documentation
+
+- [x] 5.1 Update inline code comments
+- [x] 5.2 Update spec with new table rendering requirement