feat: enable document orientation detection for scanned PDFs
- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,234 @@
|
||||
# Design: cell_boxes-First Table Rendering
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ Table Rendering Pipeline │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Input: table_element │
|
||||
│ ├── cell_boxes: [[x0,y0,x1,y1], ...] (from PP-StructureV3)│
|
||||
│ ├── html: "<table>...</table>" (from PP-StructureV3)│
|
||||
│ └── bbox: [x0, y0, x1, y1] (table boundary) │
|
||||
│ │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Step 1: Grid Inference from cell_boxes │ │
|
||||
│ │ │ │
|
||||
│ │ cell_boxes → cluster by Y → rows │ │
|
||||
│ │ → cluster by X → cols │ │
|
||||
│ │ → build grid[row][col] = cell_bbox │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Step 2: Content Extraction from HTML │ │
|
||||
│ │ │ │
|
||||
│ │ html → parse → extract text list in reading order │ │
|
||||
│ │ → flatten colspan/rowspan → [text1, text2, ...] │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Step 3: Content-to-Cell Mapping │ │
|
||||
│ │ │ │
|
||||
│ │ Option A: Sequential assignment (text[i] → cell[i]) │ │
|
||||
│ │ Option B: Coordinate matching (text_bbox ∩ cell_bbox) │ │
|
||||
│ │ Option C: Row-by-row assignment │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ ┌────────────────────────────────────────────────────────────┐ │
|
||||
│ │ Step 4: PDF Rendering │ │
|
||||
│ │ │ │
|
||||
│ │ For each cell in grid: │ │
|
||||
│ │ 1. Draw cell border at cell_bbox coordinates │ │
|
||||
│ │ 2. Render text content inside cell │ │
|
||||
│ └────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ Output: Table rendered in PDF with accurate cell boundaries │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Detailed Design
|
||||
|
||||
### 1. Grid Inference Algorithm
|
||||
|
||||
```python
|
||||
def infer_grid_from_cellboxes(cell_boxes: List[List[float]], threshold: float = 15.0):
|
||||
"""
|
||||
Infer row/column grid structure from cell_boxes coordinates.
|
||||
|
||||
Args:
|
||||
cell_boxes: List of [x0, y0, x1, y1] coordinates
|
||||
threshold: Clustering threshold for row/column grouping
|
||||
|
||||
Returns:
|
||||
grid: Dict[Tuple[int,int], Dict] mapping (row, col) to cell info
|
||||
row_heights: List of row heights
|
||||
col_widths: List of column widths
|
||||
"""
|
||||
# 1. Extract all Y-centers and X-centers
|
||||
y_centers = [(cb[1] + cb[3]) / 2 for cb in cell_boxes]
|
||||
x_centers = [(cb[0] + cb[2]) / 2 for cb in cell_boxes]
|
||||
|
||||
# 2. Cluster Y-centers into rows
|
||||
rows = cluster_values(y_centers, threshold) # Returns sorted list of row indices
|
||||
|
||||
# 3. Cluster X-centers into columns
|
||||
cols = cluster_values(x_centers, threshold) # Returns sorted list of col indices
|
||||
|
||||
# 4. Assign each cell_box to (row, col)
|
||||
grid = {}
|
||||
for i, cb in enumerate(cell_boxes):
|
||||
row = find_cluster(y_centers[i], rows)
|
||||
col = find_cluster(x_centers[i], cols)
|
||||
grid[(row, col)] = {
|
||||
'bbox': cb,
|
||||
'index': i
|
||||
}
|
||||
|
||||
# 5. Calculate actual widths/heights from boundaries
|
||||
row_heights = [rows[i+1] - rows[i] for i in range(len(rows)-1)]
|
||||
col_widths = [cols[i+1] - cols[i] for i in range(len(cols)-1)]
|
||||
|
||||
return grid, row_heights, col_widths
|
||||
```
|
||||
|
||||
### 2. Content Extraction
|
||||
|
||||
The HTML content extraction should handle colspan/rowspan by flattening:
|
||||
|
||||
```python
|
||||
def extract_cell_contents(html: str) -> List[str]:
|
||||
"""
|
||||
Extract cell text contents from HTML in reading order.
|
||||
Expands colspan/rowspan into repeated empty strings.
|
||||
|
||||
Returns:
|
||||
List of text strings, one per logical cell position
|
||||
"""
|
||||
parser = HTMLTableParser()
|
||||
parser.feed(html)
|
||||
|
||||
contents = []
|
||||
for row in parser.tables[0]['rows']:
|
||||
for cell in row['cells']:
|
||||
contents.append(cell['text'])
|
||||
# For colspan > 1, add empty strings for merged cells
|
||||
for _ in range(cell.get('colspan', 1) - 1):
|
||||
contents.append('')
|
||||
|
||||
return contents
|
||||
```
|
||||
|
||||
### 3. Content-to-Cell Mapping Strategy
|
||||
|
||||
**Recommended: Row-by-row Sequential Assignment**
|
||||
|
||||
Since HTML content is in reading order (top-to-bottom, left-to-right), map content to grid cells in the same order:
|
||||
|
||||
```python
|
||||
def map_content_to_grid(grid, contents, num_rows, num_cols):
|
||||
"""
|
||||
Map extracted content to grid cells row by row.
|
||||
"""
|
||||
content_idx = 0
|
||||
for row in range(num_rows):
|
||||
for col in range(num_cols):
|
||||
if (row, col) in grid:
|
||||
if content_idx < len(contents):
|
||||
grid[(row, col)]['content'] = contents[content_idx]
|
||||
content_idx += 1
|
||||
else:
|
||||
grid[(row, col)]['content'] = ''
|
||||
|
||||
return grid
|
||||
```
|
||||
|
||||
### 4. PDF Rendering Integration
|
||||
|
||||
Modify `pdf_generator_service.py` to use cell_boxes-first path:
|
||||
|
||||
```python
|
||||
def draw_table_region(self, ...):
|
||||
cell_boxes = table_element.get('cell_boxes', [])
|
||||
html_content = table_element.get('content', '')
|
||||
|
||||
if cell_boxes and settings.table_rendering_prefer_cellboxes:
|
||||
# Try cell_boxes-first approach
|
||||
grid, row_heights, col_widths = infer_grid_from_cellboxes(cell_boxes)
|
||||
|
||||
if grid:
|
||||
# Extract content from HTML
|
||||
contents = extract_cell_contents(html_content)
|
||||
|
||||
# Map content to grid
|
||||
grid = map_content_to_grid(grid, contents, len(row_heights), len(col_widths))
|
||||
|
||||
# Render using cell_boxes coordinates
|
||||
success = self._render_table_from_grid(
|
||||
pdf_canvas, grid, row_heights, col_widths,
|
||||
page_height, scale_w, scale_h
|
||||
)
|
||||
|
||||
if success:
|
||||
return # Done
|
||||
|
||||
# Fallback to existing HTML-based rendering
|
||||
self._render_table_from_html(...)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```python
|
||||
# config.py
|
||||
class Settings:
|
||||
# Table rendering strategy
|
||||
table_rendering_prefer_cellboxes: bool = Field(
|
||||
default=True,
|
||||
description="Use cell_boxes coordinates as primary table structure source"
|
||||
)
|
||||
|
||||
table_cellboxes_row_threshold: float = Field(
|
||||
default=15.0,
|
||||
description="Y-coordinate threshold for row clustering"
|
||||
)
|
||||
|
||||
table_cellboxes_col_threshold: float = Field(
|
||||
default=15.0,
|
||||
description="X-coordinate threshold for column clustering"
|
||||
)
|
||||
```
|
||||
|
||||
## Edge Cases
|
||||
|
||||
### 1. Empty cell_boxes
|
||||
- **Condition**: `cell_boxes` is empty or None
|
||||
- **Action**: Fall back to HTML-based rendering
|
||||
|
||||
### 2. Content Count Mismatch
|
||||
- **Condition**: HTML has more/fewer cells than cell_boxes grid
|
||||
- **Action**: Fill available cells, leave extras empty, log warning
|
||||
|
||||
### 3. Overlapping cell_boxes
|
||||
- **Condition**: Multiple cell_boxes map to same grid position
|
||||
- **Action**: Use first one, log warning
|
||||
|
||||
### 4. Single-cell Tables
|
||||
- **Condition**: Only 1 cell_box detected
|
||||
- **Action**: Render as single-cell table (valid case)
|
||||
|
||||
## Testing Plan
|
||||
|
||||
1. **Unit Tests**
|
||||
- `test_infer_grid_from_cellboxes`: Various cell_box configurations
|
||||
- `test_content_mapping`: Content assignment scenarios
|
||||
|
||||
2. **Integration Tests**
|
||||
- `test_scan_pdf_table_7`: Verify the problematic table renders correctly
|
||||
- `test_existing_tables`: No regression on previously working tables
|
||||
|
||||
3. **Visual Verification**
|
||||
- Compare PDF output before/after for `scan.pdf`
|
||||
- Check table alignment and text placement
|
||||
@@ -0,0 +1,75 @@
|
||||
# Proposal: Use cell_boxes as Primary Table Rendering Source
|
||||
|
||||
## Summary
|
||||
|
||||
Modify table PDF rendering to use `cell_boxes` coordinates as the primary source for table structure instead of relying on HTML table parsing. This resolves grid mismatch issues where PP-StructureV3's HTML structure (with colspan/rowspan) doesn't match the cell_boxes coordinate grid.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
### Current Issue
|
||||
|
||||
When processing `scan.pdf`, PP-StructureV3 detected tables with the following characteristics:
|
||||
|
||||
**Table 7 (Element 7)**:
|
||||
- `cell_boxes`: 27 cells forming an 11x10 grid (by coordinate clustering)
|
||||
- HTML structure: 9 rows with irregular columns `[7, 7, 1, 3, 3, 3, 3, 3, 1]` due to colspan
|
||||
|
||||
This **grid mismatch** causes:
|
||||
1. `_compute_table_grid_from_cell_boxes()` returns `None, None`
|
||||
2. PDF generator falls back to ReportLab Table with equal column distribution
|
||||
3. Table renders with incorrect column widths, causing visual misalignment
|
||||
|
||||
### Root Cause
|
||||
|
||||
PP-StructureV3 sometimes merges multiple visual tables into one large table region:
|
||||
- The cell_boxes accurately detect individual cell boundaries
|
||||
- The HTML uses colspan to represent merged cells, but the grid doesn't match cell_boxes
|
||||
- Current logic requires exact grid match, which fails for complex merged tables
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
### Strategy: cell_boxes-First Rendering
|
||||
|
||||
Instead of requiring HTML grid to match cell_boxes, **use cell_boxes directly** as the authoritative source for cell boundaries:
|
||||
|
||||
1. **Grid Inference from cell_boxes**
|
||||
- Cluster cell_boxes by Y-coordinate to determine rows
|
||||
- Cluster cell_boxes by X-coordinate to determine columns
|
||||
- Build a row×col grid map from cell_boxes positions
|
||||
|
||||
2. **Content Assignment from HTML**
|
||||
- Extract text content from HTML in reading order
|
||||
- Map text content to cell_boxes positions using coordinate matching
|
||||
- Handle cases where HTML has fewer/more cells than cell_boxes
|
||||
|
||||
3. **Direct PDF Rendering**
|
||||
- Render table borders using cell_boxes coordinates (already implemented)
|
||||
- Place text content at calculated cell positions
|
||||
- Skip ReportLab Table parsing when cell_boxes grid is valid
|
||||
|
||||
### Key Changes
|
||||
|
||||
| Component | Change |
|
||||
|-----------|--------|
|
||||
| `pdf_generator_service.py` | Add cell_boxes-first rendering path |
|
||||
| `table_content_rebuilder.py` | Enhance to support grid-based content mapping |
|
||||
| `config.py` | Add `table_rendering_prefer_cellboxes: bool` setting |
|
||||
|
||||
## Benefits
|
||||
|
||||
1. **Accurate Table Borders**: cell_boxes from ML detection are more precise than HTML parsing
|
||||
2. **Handles Grid Mismatch**: Works even when HTML colspan/rowspan don't match cell count
|
||||
3. **Consistent Output**: Same rendering logic regardless of HTML complexity
|
||||
4. **Backward Compatible**: Existing HTML-based rendering remains as fallback
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Not modifying PP-StructureV3 detection logic
|
||||
- Not implementing table splitting (separate proposal if needed)
|
||||
- Not changing Direct track (PyMuPDF) table extraction
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. `scan.pdf` Table 7 renders with correct column widths based on cell_boxes
|
||||
2. All existing table tests continue to pass
|
||||
3. No regression for tables where HTML grid matches cell_boxes
|
||||
@@ -0,0 +1,36 @@
|
||||
# document-processing Specification Delta
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Extract table structure (Modified)
|
||||
|
||||
The system SHALL use cell_boxes coordinates as the primary source for table structure when rendering PDFs, with HTML parsing as fallback.
|
||||
|
||||
#### Scenario: Render table using cell_boxes grid
|
||||
- **WHEN** rendering a table element to PDF
|
||||
- **AND** the table has valid cell_boxes coordinates
|
||||
- **AND** `table_rendering_prefer_cellboxes` is enabled
|
||||
- **THEN** the system SHALL infer row/column grid from cell_boxes coordinates
|
||||
- **AND** extract text content from HTML in reading order
|
||||
- **AND** map content to grid cells by position
|
||||
- **AND** render table borders using cell_boxes coordinates
|
||||
- **AND** place text content within calculated cell boundaries
|
||||
|
||||
#### Scenario: Handle cell_boxes grid mismatch gracefully
|
||||
- **WHEN** cell_boxes grid has different dimensions than HTML colspan/rowspan structure
|
||||
- **THEN** the system SHALL use cell_boxes grid as authoritative structure
|
||||
- **AND** map available HTML content to cells row-by-row
|
||||
- **AND** leave unmapped cells empty
|
||||
- **AND** log warning if content count differs significantly
|
||||
|
||||
#### Scenario: Fallback to HTML-based rendering
|
||||
- **WHEN** cell_boxes is empty or None
|
||||
- **OR** `table_rendering_prefer_cellboxes` is disabled
|
||||
- **OR** cell_boxes grid inference fails
|
||||
- **THEN** the system SHALL fall back to existing HTML-based table rendering
|
||||
- **AND** use ReportLab Table with parsed HTML structure
|
||||
|
||||
#### Scenario: Maintain backward compatibility
|
||||
- **WHEN** processing tables where cell_boxes grid matches HTML structure
|
||||
- **THEN** the system SHALL produce identical output to previous behavior
|
||||
- **AND** pass all existing table rendering tests
|
||||
@@ -0,0 +1,48 @@
|
||||
## 1. Core Algorithm Implementation
|
||||
|
||||
### 1.1 Grid Inference Module
|
||||
- [x] 1.1.1 Create `CellBoxGridInferrer` class in `pdf_table_renderer.py`
|
||||
- [x] 1.1.2 Implement `cluster_values()` for Y/X coordinate clustering
|
||||
- [x] 1.1.3 Implement `infer_grid_from_cellboxes()` main method
|
||||
- [x] 1.1.4 Add row_heights and col_widths calculation
|
||||
|
||||
### 1.2 Content Mapping
|
||||
- [x] 1.2.1 Implement `extract_cell_contents()` from HTML
|
||||
- [x] 1.2.2 Implement `map_content_to_grid()` for row-by-row assignment
|
||||
- [x] 1.2.3 Handle content count mismatch (more/fewer cells)
|
||||
|
||||
## 2. PDF Generator Integration
|
||||
|
||||
### 2.1 New Rendering Path
|
||||
- [x] 2.1.1 Add `render_from_cellboxes_grid()` method to TableRenderer
|
||||
- [x] 2.1.2 Integrate into `draw_table_region()` with cellboxes-first check
|
||||
- [x] 2.1.3 Maintain fallback to existing HTML-based rendering
|
||||
|
||||
### 2.2 Cell Rendering
|
||||
- [x] 2.2.1 Draw cell borders using cell_boxes coordinates
|
||||
- [x] 2.2.2 Render text content with proper alignment and padding
|
||||
- [x] 2.2.3 Handle multi-line text within cells
|
||||
|
||||
## 3. Configuration
|
||||
|
||||
### 3.1 Settings
|
||||
- [x] 3.1.1 Add `table_rendering_prefer_cellboxes: bool = True`
|
||||
- [x] 3.1.2 Add `table_cellboxes_row_threshold: float = 15.0`
|
||||
- [x] 3.1.3 Add `table_cellboxes_col_threshold: float = 15.0`
|
||||
|
||||
## 4. Testing
|
||||
|
||||
### 4.1 Unit Tests
|
||||
- [x] 4.1.1 Test grid inference with various cell_box configurations
|
||||
- [x] 4.1.2 Test content mapping edge cases
|
||||
- [x] 4.1.3 Test coordinate clustering accuracy
|
||||
|
||||
### 4.2 Integration Tests
|
||||
- [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case)
|
||||
- [ ] 4.2.2 Verify no regression on existing table tests
|
||||
- [ ] 4.2.3 Visual comparison of output PDFs
|
||||
|
||||
## 5. Documentation
|
||||
|
||||
- [x] 5.1 Update inline code comments
|
||||
- [x] 5.2 Update spec with new table rendering requirement
|
||||
Reference in New Issue
Block a user