Backend changes: - Add image scaling configuration for PP-Structure processing - Enhance layout preprocessing service with scaling support - Update OCR service with improved memory management - Add PP-Structure enhanced processing improvements Frontend changes: - Update preprocessing settings UI - Fix processing page layout and state management - Update API types for new parameters Proposals: - Archive add-layout-preprocessing proposal (completed) - Add unify-image-scaling proposal for consistent coordinate handling 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
193 lines
6.1 KiB
Markdown
193 lines
6.1 KiB
Markdown
# Design: Layout Detection Image Preprocessing
|
|
|
|
## Context
|
|
|
|
PP-StructureV3's layout detection model (PP-DocLayout_plus-L) sometimes fails to detect tables with faint lines or low contrast. This is a preprocessing problem - the model can detect tables when lines are clearly visible, but struggles with poor quality scans or documents with light-colored borders.
|
|
|
|
### Current Flow
|
|
```
|
|
Original Image → PP-Structure (layout detection) → Element Recognition
|
|
↓
|
|
Returns element bboxes
|
|
↓
|
|
Image extraction crops from original
|
|
```
|
|
|
|
### Proposed Flow
|
|
```
|
|
Original Image → Preprocess → PP-Structure (layout detection) → Element Recognition
|
|
↓
|
|
Returns element bboxes
|
|
↓
|
|
Original Image ← ← ← ← Image extraction crops from original (NOT preprocessed)
|
|
```
|
|
|
|
## Goals / Non-Goals
|
|
|
|
### Goals
|
|
- Improve table detection for documents with faint lines
|
|
- Preserve original image quality for element extraction
|
|
- **Hybrid control**: Auto mode by default, manual override available
|
|
- **Preview capability**: Users can verify preprocessing before processing
|
|
- Minimal performance impact
|
|
|
|
### Non-Goals
|
|
- Preprocessing for text recognition (Raw OCR handles this separately)
|
|
- Modifying how PP-Structure internally processes images
|
|
- General image quality improvement (out of scope)
|
|
- Real-time preview during processing (preview is pre-processing only)
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: Preprocess only for layout detection input
|
|
**Rationale**:
|
|
- Layout detection needs enhanced edges/contrast to identify regions
|
|
- Image element extraction needs original quality for output
|
|
- Raw OCR text recognition works independently and doesn't need preprocessing
|
|
|
|
### Decision 2: Use CLAHE (Contrast Limited Adaptive Histogram Equalization) as default
|
|
**Rationale**:
|
|
- CLAHE prevents over-amplification in already bright areas
|
|
- Adaptive nature handles varying background regions
|
|
- Well-supported by OpenCV
|
|
|
|
**Alternatives considered**:
|
|
- Global histogram equalization: Too aggressive, causes artifacts
|
|
- Manual brightness/contrast: Not adaptive to document variations
|
|
|
|
### Decision 3: Preprocessing is applied in-memory, not saved to disk
|
|
**Rationale**:
|
|
- Preprocessed image is only needed during PP-Structure call
|
|
- Saving would increase storage and I/O overhead
|
|
- Original image is already saved and used for extraction
|
|
|
|
### Decision 4: Sharpening via Unsharp Mask
|
|
**Rationale**:
|
|
- Enhances edges without introducing noise
|
|
- Helps make faint table borders more detectable
|
|
- Configurable strength
|
|
|
|
### Decision 5: Hybrid Control Mode (Auto + Manual)
|
|
**Rationale**:
|
|
- Auto mode provides seamless experience for most users
|
|
- Manual mode gives power users fine control
|
|
- Preview allows verification before committing to processing
|
|
|
|
**Auto-detection algorithm**:
|
|
```python
|
|
def analyze_image_quality(image: np.ndarray) -> dict:
|
|
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
|
|
|
|
# Contrast: standard deviation of pixel values
|
|
contrast = np.std(gray)
|
|
|
|
# Edge strength: mean of Sobel gradient magnitude
|
|
sobel_x = cv2.Sobel(gray, cv2.CV_64F, 1, 0, ksize=3)
|
|
sobel_y = cv2.Sobel(gray, cv2.CV_64F, 0, 1, ksize=3)
|
|
edge_strength = np.mean(np.sqrt(sobel_x**2 + sobel_y**2))
|
|
|
|
return {
|
|
"contrast": contrast,
|
|
"edge_strength": edge_strength,
|
|
"recommended": {
|
|
"contrast": "clahe" if contrast < 40 else "none",
|
|
"sharpen": edge_strength < 15,
|
|
"binarize": contrast < 20
|
|
}
|
|
}
|
|
```
|
|
|
|
### Decision 6: Preview API Design
|
|
**Rationale**:
|
|
- Users should see preprocessing effect before full processing
|
|
- Reduces trial-and-error cycles
|
|
- Builds user confidence in the system
|
|
|
|
**API Design**:
|
|
```
|
|
POST /api/v2/tasks/{task_id}/preview/preprocessing
|
|
Request:
|
|
{
|
|
"page": 1,
|
|
"mode": "auto", // or "manual"
|
|
"config": { // only for manual mode
|
|
"contrast": "clahe",
|
|
"sharpen": true,
|
|
"binarize": false
|
|
}
|
|
}
|
|
|
|
Response:
|
|
{
|
|
"original_url": "/api/v2/tasks/{id}/pages/1/image",
|
|
"preprocessed_url": "/api/v2/tasks/{id}/pages/1/image?preprocessed=true",
|
|
"quality_metrics": {
|
|
"contrast": 35.2,
|
|
"edge_strength": 12.8
|
|
},
|
|
"auto_config": {
|
|
"contrast": "clahe",
|
|
"sharpen": true,
|
|
"binarize": false
|
|
}
|
|
}
|
|
```
|
|
|
|
## Implementation Details
|
|
|
|
### Preprocessing Pipeline
|
|
```python
|
|
def enhance_for_layout_detection(image: Image.Image, config: Settings) -> Image.Image:
|
|
"""Enhance image for better layout detection."""
|
|
|
|
# Step 1: Contrast enhancement
|
|
if config.layout_preprocessing_contrast == "clahe":
|
|
image = apply_clahe(image)
|
|
elif config.layout_preprocessing_contrast == "histogram":
|
|
image = apply_histogram_equalization(image)
|
|
|
|
# Step 2: Sharpening (optional)
|
|
if config.layout_preprocessing_sharpen:
|
|
image = apply_unsharp_mask(image)
|
|
|
|
# Step 3: Binarization (optional, aggressive)
|
|
if config.layout_preprocessing_binarize:
|
|
image = apply_adaptive_threshold(image)
|
|
|
|
return image
|
|
```
|
|
|
|
### Integration Point
|
|
```python
|
|
# In ocr_service.py, before calling PP-Structure
|
|
if settings.layout_preprocessing_enabled:
|
|
preprocessed_image = enhance_for_layout_detection(page_image, settings)
|
|
pp_input = preprocessed_image
|
|
else:
|
|
pp_input = page_image
|
|
|
|
# PP-Structure gets preprocessed (or original if disabled)
|
|
layout_results = self.structure_engine(pp_input)
|
|
|
|
# Image extraction still uses original
|
|
for element in layout_results:
|
|
if element.type == "image":
|
|
crop_image_from_original(page_image, element.bbox) # Use original!
|
|
```
|
|
|
|
## Risks / Trade-offs
|
|
|
|
| Risk | Mitigation |
|
|
|------|------------|
|
|
| Performance overhead | Preprocessing is fast (~50ms/page), enable/disable option |
|
|
| Over-enhancement artifacts | CLAHE clip limit prevents over-saturation, configurable |
|
|
| Memory spike for large images | Process one page at a time, discard preprocessed after use |
|
|
|
|
## Open Questions
|
|
|
|
1. Should binarization be applied before or after CLAHE?
|
|
- Current: After (enhances contrast first, then binarize if needed)
|
|
|
|
2. Should preprocessing parameters be tunable per-request or only server-wide?
|
|
- Current: Server-wide config only (simpler)
|