Files
OCR/openspec/changes/add-layout-preprocessing/design.md
egg c12ea0b9f6 proposal: add-layout-preprocessing for improved table detection
Problem: PP-Structure misses tables with faint lines/borders
Solution: Preprocess images (contrast, sharpen) for layout detection
- Preprocessed image only used for layout detection
- Original image preserved for element extraction (quality)

Includes: proposal.md, design.md, tasks.md, spec delta

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 14:24:23 +08:00

4.3 KiB

Design: Layout Detection Image Preprocessing

Context

PP-StructureV3's layout detection model (PP-DocLayout_plus-L) sometimes fails to detect tables with faint lines or low contrast. This is a preprocessing problem - the model can detect tables when lines are clearly visible, but struggles with poor quality scans or documents with light-colored borders.

Current Flow

Original Image → PP-Structure (layout detection) → Element Recognition
                      ↓
              Returns element bboxes
                      ↓
              Image extraction crops from original

Proposed Flow

Original Image → Preprocess → PP-Structure (layout detection) → Element Recognition
                                      ↓
                              Returns element bboxes
                                      ↓
Original Image ← ← ← ← Image extraction crops from original (NOT preprocessed)

Goals / Non-Goals

Goals

  • Improve table detection for documents with faint lines
  • Preserve original image quality for element extraction
  • Make preprocessing configurable (enable/disable, intensity)
  • Minimal performance impact

Non-Goals

  • Preprocessing for text recognition (Raw OCR handles this separately)
  • Modifying how PP-Structure internally processes images
  • General image quality improvement (out of scope)

Decisions

Decision 1: Preprocess only for layout detection input

Rationale:

  • Layout detection needs enhanced edges/contrast to identify regions
  • Image element extraction needs original quality for output
  • Raw OCR text recognition works independently and doesn't need preprocessing

Decision 2: Use CLAHE (Contrast Limited Adaptive Histogram Equalization) as default

Rationale:

  • CLAHE prevents over-amplification in already bright areas
  • Adaptive nature handles varying background regions
  • Well-supported by OpenCV

Alternatives considered:

  • Global histogram equalization: Too aggressive, causes artifacts
  • Manual brightness/contrast: Not adaptive to document variations

Decision 3: Preprocessing is applied in-memory, not saved to disk

Rationale:

  • Preprocessed image is only needed during PP-Structure call
  • Saving would increase storage and I/O overhead
  • Original image is already saved and used for extraction

Decision 4: Sharpening via Unsharp Mask

Rationale:

  • Enhances edges without introducing noise
  • Helps make faint table borders more detectable
  • Configurable strength

Implementation Details

Preprocessing Pipeline

def enhance_for_layout_detection(image: Image.Image, config: Settings) -> Image.Image:
    """Enhance image for better layout detection."""

    # Step 1: Contrast enhancement
    if config.layout_preprocessing_contrast == "clahe":
        image = apply_clahe(image)
    elif config.layout_preprocessing_contrast == "histogram":
        image = apply_histogram_equalization(image)

    # Step 2: Sharpening (optional)
    if config.layout_preprocessing_sharpen:
        image = apply_unsharp_mask(image)

    # Step 3: Binarization (optional, aggressive)
    if config.layout_preprocessing_binarize:
        image = apply_adaptive_threshold(image)

    return image

Integration Point

# In ocr_service.py, before calling PP-Structure
if settings.layout_preprocessing_enabled:
    preprocessed_image = enhance_for_layout_detection(page_image, settings)
    pp_input = preprocessed_image
else:
    pp_input = page_image

# PP-Structure gets preprocessed (or original if disabled)
layout_results = self.structure_engine(pp_input)

# Image extraction still uses original
for element in layout_results:
    if element.type == "image":
        crop_image_from_original(page_image, element.bbox)  # Use original!

Risks / Trade-offs

Risk Mitigation
Performance overhead Preprocessing is fast (~50ms/page), enable/disable option
Over-enhancement artifacts CLAHE clip limit prevents over-saturation, configurable
Memory spike for large images Process one page at a time, discard preprocessed after use

Open Questions

  1. Should binarization be applied before or after CLAHE?

    • Current: After (enhances contrast first, then binarize if needed)
  2. Should preprocessing parameters be tunable per-request or only server-wide?

    • Current: Server-wide config only (simpler)