proposal: add-layout-preprocessing for improved table detection

Problem: PP-Structure misses tables with faint lines/borders
Solution: Preprocess images (contrast, sharpen) for layout detection
- Preprocessed image only used for layout detection
- Original image preserved for element extraction (quality)

Includes: proposal.md, design.md, tasks.md, spec delta

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-27 14:24:23 +08:00
parent 5448a047ff
commit c12ea0b9f6
4 changed files with 295 additions and 0 deletions

View File

@@ -0,0 +1,124 @@
# Design: Layout Detection Image Preprocessing
## Context
PP-StructureV3's layout detection model (PP-DocLayout_plus-L) sometimes fails to detect tables with faint lines or low contrast. This is a preprocessing problem - the model can detect tables when lines are clearly visible, but struggles with poor quality scans or documents with light-colored borders.
### Current Flow
```
Original Image → PP-Structure (layout detection) → Element Recognition
Returns element bboxes
Image extraction crops from original
```
### Proposed Flow
```
Original Image → Preprocess → PP-Structure (layout detection) → Element Recognition
Returns element bboxes
Original Image ← ← ← ← Image extraction crops from original (NOT preprocessed)
```
## Goals / Non-Goals
### Goals
- Improve table detection for documents with faint lines
- Preserve original image quality for element extraction
- Make preprocessing configurable (enable/disable, intensity)
- Minimal performance impact
### Non-Goals
- Preprocessing for text recognition (Raw OCR handles this separately)
- Modifying how PP-Structure internally processes images
- General image quality improvement (out of scope)
## Decisions
### Decision 1: Preprocess only for layout detection input
**Rationale**:
- Layout detection needs enhanced edges/contrast to identify regions
- Image element extraction needs original quality for output
- Raw OCR text recognition works independently and doesn't need preprocessing
### Decision 2: Use CLAHE (Contrast Limited Adaptive Histogram Equalization) as default
**Rationale**:
- CLAHE prevents over-amplification in already bright areas
- Adaptive nature handles varying background regions
- Well-supported by OpenCV
**Alternatives considered**:
- Global histogram equalization: Too aggressive, causes artifacts
- Manual brightness/contrast: Not adaptive to document variations
### Decision 3: Preprocessing is applied in-memory, not saved to disk
**Rationale**:
- Preprocessed image is only needed during PP-Structure call
- Saving would increase storage and I/O overhead
- Original image is already saved and used for extraction
### Decision 4: Sharpening via Unsharp Mask
**Rationale**:
- Enhances edges without introducing noise
- Helps make faint table borders more detectable
- Configurable strength
## Implementation Details
### Preprocessing Pipeline
```python
def enhance_for_layout_detection(image: Image.Image, config: Settings) -> Image.Image:
"""Enhance image for better layout detection."""
# Step 1: Contrast enhancement
if config.layout_preprocessing_contrast == "clahe":
image = apply_clahe(image)
elif config.layout_preprocessing_contrast == "histogram":
image = apply_histogram_equalization(image)
# Step 2: Sharpening (optional)
if config.layout_preprocessing_sharpen:
image = apply_unsharp_mask(image)
# Step 3: Binarization (optional, aggressive)
if config.layout_preprocessing_binarize:
image = apply_adaptive_threshold(image)
return image
```
### Integration Point
```python
# In ocr_service.py, before calling PP-Structure
if settings.layout_preprocessing_enabled:
preprocessed_image = enhance_for_layout_detection(page_image, settings)
pp_input = preprocessed_image
else:
pp_input = page_image
# PP-Structure gets preprocessed (or original if disabled)
layout_results = self.structure_engine(pp_input)
# Image extraction still uses original
for element in layout_results:
if element.type == "image":
crop_image_from_original(page_image, element.bbox) # Use original!
```
## Risks / Trade-offs
| Risk | Mitigation |
|------|------------|
| Performance overhead | Preprocessing is fast (~50ms/page), enable/disable option |
| Over-enhancement artifacts | CLAHE clip limit prevents over-saturation, configurable |
| Memory spike for large images | Process one page at a time, discard preprocessed after use |
## Open Questions
1. Should binarization be applied before or after CLAHE?
- Current: After (enhances contrast first, then binarize if needed)
2. Should preprocessing parameters be tunable per-request or only server-wide?
- Current: Server-wide config only (simpler)

View File

@@ -0,0 +1,62 @@
# Change: Add Image Preprocessing for Layout Detection
## Why
PP-StructureV3's layout detection (PP-DocLayout_plus-L) sometimes fails to detect tables with faint lines, low contrast borders, or poor scan quality. This results in missing table elements in the output, even when the table structure recognition models (SLANeXt) are correctly configured.
The root cause is that layout detection happens **before** table structure recognition - if a region isn't identified as a "table" in the layout detection stage, the table recognition models never get invoked.
## What Changes
- **Add image preprocessing module** for layout detection input
- Contrast enhancement (histogram equalization, CLAHE)
- Optional binarization (adaptive thresholding)
- Sharpening for faint lines
- **Preserve original images for extraction**
- Preprocessing ONLY affects layout detection input
- Image element extraction continues to use original (preserves quality)
- Raw OCR continues to use original image
- **Configurable preprocessing options**
- Enable/disable preprocessing per track
- Adjustable preprocessing intensity
## Impact
### Affected Specs
- `ocr-processing` - New preprocessing configuration requirements
### Affected Code
- `backend/app/services/ocr_service.py` - Add preprocessing before PP-Structure
- `backend/app/core/config.py` - New preprocessing configuration options
- `backend/app/services/preprocessing_service.py` - New service (to be created)
### Track Impact Analysis
| Track | Impact | Reason |
|-------|--------|--------|
| OCR | Improved layout detection | Preprocessing enhances PP-Structure input |
| Hybrid | Potentially improved | Uses PP-Structure for layout |
| Direct | No impact | Does not use PP-Structure |
| Raw OCR | No impact | Continues using original image |
### Quality Impact
| Component | Impact | Reason |
|-----------|--------|--------|
| Table detection | Improved | Enhanced contrast reveals faint borders |
| Image extraction | No change | Uses original image for quality |
| Text recognition | No change | Raw OCR uses original image |
| Reading order | Improved | Better element detection → better ordering |
## Risks
1. **Performance overhead**: Preprocessing adds compute time per page
- Mitigation: Make preprocessing optional, cache preprocessed images
2. **Over-processing**: Strong enhancement may introduce artifacts
- Mitigation: Configurable intensity levels, default to moderate enhancement
3. **Memory usage**: Keeping both original and preprocessed images
- Mitigation: Preprocessed image is temporary, discarded after layout detection

View File

@@ -0,0 +1,55 @@
## ADDED Requirements
### Requirement: Layout Detection Image Preprocessing
The system SHALL provide optional image preprocessing to enhance layout detection accuracy for documents with faint lines, low contrast, or poor scan quality.
#### Scenario: Preprocessing improves table detection
- **GIVEN** a document with faint table borders that PP-Structure fails to detect
- **WHEN** layout preprocessing is enabled
- **THEN** the system SHALL preprocess the image before layout detection
- **AND** contrast enhancement SHALL make faint lines more visible
- **AND** PP-Structure SHALL receive the preprocessed image for layout detection
#### Scenario: Image element extraction uses original quality
- **GIVEN** an image element detected by PP-Structure from preprocessed input
- **WHEN** the system extracts the image element
- **THEN** the system SHALL crop from the ORIGINAL image, not the preprocessed version
- **AND** the extracted image SHALL maintain original quality and colors
#### Scenario: Preprocessing can be disabled
- **GIVEN** `layout_preprocessing_enabled` is set to false in configuration
- **WHEN** OCR track processing runs
- **THEN** the system SHALL skip preprocessing
- **AND** PP-Structure SHALL receive the original image directly
#### Scenario: CLAHE contrast enhancement
- **WHEN** `layout_preprocessing_contrast` is set to "clahe"
- **THEN** the system SHALL apply Contrast Limited Adaptive Histogram Equalization
- **AND** the enhancement SHALL not over-saturate already bright regions
#### Scenario: Sharpening enhances faint lines
- **WHEN** `layout_preprocessing_sharpen` is enabled
- **THEN** the system SHALL apply unsharp masking to enhance edges
- **AND** faint table borders SHALL become more detectable
#### Scenario: Optional binarization for extreme cases
- **WHEN** `layout_preprocessing_binarize` is enabled
- **THEN** the system SHALL apply adaptive thresholding
- **AND** this SHALL be used only for documents with very poor contrast
### Requirement: Preprocessing Track Isolation
The layout preprocessing feature SHALL only affect layout detection input without impacting other processing components.
#### Scenario: Raw OCR is unaffected
- **GIVEN** layout preprocessing is enabled
- **WHEN** Raw OCR processing runs
- **THEN** Raw OCR SHALL use the original image
- **AND** text detection quality SHALL not be affected by preprocessing
#### Scenario: Preprocessed image is temporary
- **GIVEN** an image is preprocessed for layout detection
- **WHEN** layout detection completes
- **THEN** the preprocessed image SHALL NOT be persisted to storage
- **AND** only the original image and element crops SHALL be saved

View File

@@ -0,0 +1,54 @@
# Tasks: Add Image Preprocessing for Layout Detection
## 1. Configuration
- [ ] 1.1 Add preprocessing configuration to `backend/app/core/config.py`
- `layout_preprocessing_enabled: bool = True` - Enable/disable preprocessing
- `layout_preprocessing_contrast: str = "clahe"` - Options: none, histogram, clahe
- `layout_preprocessing_sharpen: bool = True` - Enable sharpening for faint lines
- `layout_preprocessing_binarize: bool = False` - Optional binarization (aggressive)
## 2. Preprocessing Service
- [ ] 2.1 Create `backend/app/services/preprocessing_service.py`
- Image loading utility (supports PIL, OpenCV)
- Contrast enhancement methods (histogram equalization, CLAHE)
- Sharpening filter for line enhancement
- Optional adaptive binarization
- Return preprocessed image as numpy array or PIL Image
- [ ] 2.2 Implement `enhance_for_layout_detection()` function
- Input: Original image path or PIL Image
- Output: Preprocessed image (same format as input)
- Steps: contrast → sharpen → (optional) binarize
## 3. Integration with OCR Service
- [ ] 3.1 Update `backend/app/services/ocr_service.py`
- Import preprocessing service
- Before `_run_ppstructure()`, preprocess image if enabled
- Pass preprocessed image to PP-Structure for layout detection
- Keep original image reference for image extraction
- [ ] 3.2 Ensure image element extraction uses original
- Verify `saved_path` and `img_path` in elements reference original
- Bbox coordinates from preprocessed detection applied to original crop
## 4. Testing
- [ ] 4.1 Unit tests for preprocessing_service
- Test contrast enhancement methods
- Test sharpening filter
- Test binarization
- Test with various image formats (PNG, JPEG)
- [ ] 4.2 Integration tests
- Test OCR track with preprocessing enabled/disabled
- Verify image element quality is preserved
- Test with known problematic documents (faint table borders)
## 5. Documentation
- [ ] 5.1 Update API documentation
- Document new configuration options
- Explain preprocessing behavior