feat: enable document orientation detection for scanned PDFs

- Enable PP-StructureV3's use_doc_orientation_classify feature
- Detect rotation angle from doc_preprocessor_res.angle
- Swap page dimensions (width <-> height) for 90°/270° rotations
- Output PDF now correctly displays landscape-scanned content

Also includes:
- Archive completed openspec proposals
- Add simplify-frontend-ocr-config proposal (pending)
- Code cleanup and frontend simplification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
egg
2025-12-11 17:13:46 +08:00
parent 57070af307
commit cfe65158a3
58 changed files with 1271 additions and 3048 deletions

View File

@@ -0,0 +1,141 @@
# Design: Simple Text Positioning
## Architecture
### Current Flow (Complex)
```
Raw OCR → PP-Structure Analysis → Table Detection → HTML Parsing →
Column Correction → Cell Positioning → PDF Generation
```
### New Flow (Simple)
```
Raw OCR → Text Region Extraction → Bbox Processing →
Rotation Calculation → Font Size Estimation → PDF Text Rendering
```
## Core Components
### 1. TextRegionRenderer
New service class to handle raw OCR text rendering:
```python
class TextRegionRenderer:
"""Render raw OCR text regions to PDF."""
def render_text_region(
self,
canvas: Canvas,
region: Dict,
scale_factor: float
) -> None:
"""
Render a single OCR text region.
Args:
canvas: ReportLab canvas
region: Raw OCR region with text and bbox
scale_factor: Coordinate scaling factor
"""
```
### 2. Bbox Processing
Raw OCR bbox format (quadrilateral - 4 corner points):
```json
{
"text": "LOCTITE",
"bbox": [[116, 76], [378, 76], [378, 128], [116, 128]],
"confidence": 0.98
}
```
Processing steps:
1. **Center point**: Average of 4 corners
2. **Width/Height**: Distance between corners
3. **Rotation angle**: Angle of top edge from horizontal
4. **Font size**: Approximate from bbox height
### 3. Rotation Calculation
```python
def calculate_rotation(bbox: List[List[float]]) -> float:
"""
Calculate text rotation from bbox quadrilateral.
Returns angle in degrees (counter-clockwise from horizontal).
"""
# Top-left to top-right vector
dx = bbox[1][0] - bbox[0][0]
dy = bbox[1][1] - bbox[0][1]
# Angle in degrees
angle = math.atan2(dy, dx) * 180 / math.pi
return angle
```
### 4. Font Size Estimation
```python
def estimate_font_size(bbox: List[List[float]], text: str) -> float:
"""
Estimate font size from bbox dimensions.
Uses bbox height as primary indicator, adjusted for aspect ratio.
"""
# Calculate bbox height (average of left and right edges)
left_height = math.dist(bbox[0], bbox[3])
right_height = math.dist(bbox[1], bbox[2])
avg_height = (left_height + right_height) / 2
# Font size is approximately 70-80% of bbox height
return avg_height * 0.75
```
## Integration Points
### PDFGeneratorService
Modify `draw_ocr_content()` to use simple text positioning:
```python
def draw_ocr_content(self, canvas, content_data, page_info):
"""Draw OCR content using simple text positioning."""
# Use raw OCR regions directly
raw_regions = content_data.get('raw_ocr_regions', [])
for region in raw_regions:
self.text_renderer.render_text_region(
canvas, region, scale_factor
)
```
### Configuration
Add config option to enable/disable simple mode:
```python
class OCRSettings:
simple_text_positioning: bool = Field(
default=True,
description="Use simple text positioning instead of table reconstruction"
)
```
## File Changes
| File | Change |
|------|--------|
| `app/services/text_region_renderer.py` | New - Text rendering logic |
| `app/services/pdf_generator_service.py` | Modify - Integration |
| `app/core/config.py` | Add - Configuration option |
## Edge Cases
1. **Overlapping text**: Regions may overlap slightly - render in reading order
2. **Very small text**: Minimum font size threshold (6pt)
3. **Rotated pages**: Handle 90/180/270 degree page rotation
4. **Empty regions**: Skip regions with empty text
5. **Unicode text**: Ensure font supports CJK characters

View File

@@ -0,0 +1,42 @@
# Simple Text Positioning from Raw OCR
## Summary
Simplify OCR track PDF generation by rendering raw OCR text at correct positions without complex table structure reconstruction.
## Problem
Current OCR track processing has multiple failure points:
1. PP-Structure table structure recognition fails for borderless tables
2. Multi-column layouts get merged incorrectly into single tables
3. Table HTML reconstruction produces wrong cell positions
4. Complex column correction algorithms still can't fix fundamental structure errors
Meanwhile, raw OCR (`raw_ocr_regions.json`) correctly identifies all text with accurate bounding boxes.
## Solution
Replace complex table reconstruction with simple text positioning:
1. Read raw OCR regions directly
2. Position text at bbox coordinates
3. Calculate text rotation from bbox quadrilateral shape
4. Estimate font size from bbox height
5. Skip table HTML parsing entirely for OCR track
## Benefits
- **Reliability**: Raw OCR text positions are accurate
- **Simplicity**: Eliminates complex table parsing logic
- **Performance**: Faster processing without structure analysis
- **Consistency**: Predictable output regardless of table type
## Trade-offs
- No table borders in output
- No cell structure (colspan, rowspan)
- Visual layout approximation rather than semantic structure
## Scope
- OCR track PDF generation only
- Direct track remains unchanged (uses native PDF text extraction)

View File

@@ -0,0 +1,57 @@
# Tasks: Simple Text Positioning
## Phase 1: Core Implementation
- [x] Create `TextRegionRenderer` class in `app/services/text_region_renderer.py`
- [x] Implement `calculate_rotation()` from bbox quadrilateral
- [x] Implement `estimate_font_size()` from bbox height
- [x] Implement `render_text_region()` main method
- [x] Handle coordinate system transformation (OCR → PDF)
## Phase 2: Integration
- [x] Add `simple_text_positioning_enabled` config option
- [x] Modify `PDFGeneratorService._generate_ocr_track_pdf()` to use `TextRegionRenderer`
- [x] Ensure raw OCR regions are loaded correctly via `load_raw_ocr_regions()`
## Phase 3: Image/Chart/Formula Support
- [x] Add image element type detection (`figure`, `image`, `chart`, `seal`, `formula`)
- [x] Render image elements from UnifiedDocument to PDF
- [x] Handle image path resolution (result_dir, imgs/ subdirectory)
- [x] Coordinate transformation for image placement
## Phase 4: Text Straightening & Overlap Avoidance
- [x] Add rotation straightening threshold (default 10°)
- Small rotation angles (< 10°) are treated as 0° for clean output
- Only significant rotations (e.g., 90°) are preserved
- [x] Add IoA (Intersection over Area) overlap detection
- IoA threshold default 0.3 (30% overlap triggers skip)
- Text regions overlapping with images/charts are skipped
- [x] Collect exclusion zones from image elements
- [x] Pass exclusion zones to text renderer
## Phase 5: Chart Axis Label Deduplication
- [x] Add `is_axis_label()` method to detect axis labels
- Y-axis: Vertical text immediately left of chart
- X-axis: Horizontal text immediately below chart
- [x] Add `is_near_zone()` method for proximity checking
- [x] Position-aware deduplication in `render_text_region()`
- Collect texts inside zones + axis labels
- Skip matching text only if near zone or is axis label
- Preserve matching text far from zones (e.g., table values)
- [x] Test results:
- "Temperature, C" and "Syringe Thaw Time, Minutes" correctly skipped
- Table values like "10" at top of page correctly rendered
- Page 2: 128/148 text regions rendered (12 overlap + 8 dedupe)
## Phase 6: Testing
- [x] Test with scan.pdf task (064e2d67-338c-4e54-b005-204c3b76fe63)
- Page 2: Chart image rendered, axis labels deduplicated
- PDF is searchable and selectable
- Text is properly straightened (no skew artifacts)
- [ ] Compare output quality vs original scan visually
- [ ] Test with documents containing seals/formulas