egg/OCR

Files

egg 06a5973f2e proposal: add hybrid control mode with auto-detection and preview

Updates add-layout-preprocessing proposal:
- Auto mode: analyze image quality, auto-select parameters
- Manual mode: user override with specific settings
- Preview API: compare original vs preprocessed before processing
- Frontend UI: mode selection, manual controls, preview button

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-27 14:31:09 +08:00

6.1 KiB

Raw Blame History

Design: Layout Detection Image Preprocessing

Context

PP-StructureV3's layout detection model (PP-DocLayout_plus-L) sometimes fails to detect tables with faint lines or low contrast. This is a preprocessing problem - the model can detect tables when lines are clearly visible, but struggles with poor quality scans or documents with light-colored borders.

Current Flow

Original Image → PP-Structure (layout detection) → Element Recognition
                      ↓
              Returns element bboxes
                      ↓
              Image extraction crops from original

Proposed Flow

Original Image → Preprocess → PP-Structure (layout detection) → Element Recognition
                                      ↓
                              Returns element bboxes
                                      ↓
Original Image ← ← ← ← Image extraction crops from original (NOT preprocessed)

Goals / Non-Goals

Goals

Improve table detection for documents with faint lines
Preserve original image quality for element extraction
Hybrid control: Auto mode by default, manual override available
Preview capability: Users can verify preprocessing before processing
Minimal performance impact

Non-Goals

Preprocessing for text recognition (Raw OCR handles this separately)
Modifying how PP-Structure internally processes images
General image quality improvement (out of scope)
Real-time preview during processing (preview is pre-processing only)

Decisions

Decision 1: Preprocess only for layout detection input

Rationale:

Layout detection needs enhanced edges/contrast to identify regions
Image element extraction needs original quality for output
Raw OCR text recognition works independently and doesn't need preprocessing

Decision 2: Use CLAHE (Contrast Limited Adaptive Histogram Equalization) as default

Rationale:

CLAHE prevents over-amplification in already bright areas
Adaptive nature handles varying background regions
Well-supported by OpenCV

Alternatives considered:

Global histogram equalization: Too aggressive, causes artifacts
Manual brightness/contrast: Not adaptive to document variations

Decision 3: Preprocessing is applied in-memory, not saved to disk

Rationale:

Preprocessed image is only needed during PP-Structure call
Saving would increase storage and I/O overhead
Original image is already saved and used for extraction

Decision 4: Sharpening via Unsharp Mask

Rationale:

Enhances edges without introducing noise
Helps make faint table borders more detectable
Configurable strength

Decision 5: Hybrid Control Mode (Auto + Manual)

Rationale:

Auto mode provides seamless experience for most users
Manual mode gives power users fine control
Preview allows verification before committing to processing

Auto-detection algorithm:

def analyze_image_quality(image: np.ndarray) -> dict:
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Contrast: standard deviation of pixel values
    contrast = np.std(gray)

    # Edge strength: mean of Sobel gradient magnitude
    sobel_x = cv2.Sobel(gray, cv2.CV_64F, 1, 0, ksize=3)
    sobel_y = cv2.Sobel(gray, cv2.CV_64F, 0, 1, ksize=3)
    edge_strength = np.mean(np.sqrt(sobel_x**2 + sobel_y**2))

    return {
        "contrast": contrast,
        "edge_strength": edge_strength,
        "recommended": {
            "contrast": "clahe" if contrast < 40 else "none",
            "sharpen": edge_strength < 15,
            "binarize": contrast < 20
        }
    }

Decision 6: Preview API Design

Rationale:

Users should see preprocessing effect before full processing
Reduces trial-and-error cycles
Builds user confidence in the system

API Design:

POST /api/v2/tasks/{task_id}/preview/preprocessing
Request:
{
  "page": 1,
  "mode": "auto",  // or "manual"
  "config": {      // only for manual mode
    "contrast": "clahe",
    "sharpen": true,
    "binarize": false
  }
}

Response:
{
  "original_url": "/api/v2/tasks/{id}/pages/1/image",
  "preprocessed_url": "/api/v2/tasks/{id}/pages/1/image?preprocessed=true",
  "quality_metrics": {
    "contrast": 35.2,
    "edge_strength": 12.8
  },
  "auto_config": {
    "contrast": "clahe",
    "sharpen": true,
    "binarize": false
  }
}

Implementation Details

Preprocessing Pipeline

def enhance_for_layout_detection(image: Image.Image, config: Settings) -> Image.Image:
    """Enhance image for better layout detection."""

    # Step 1: Contrast enhancement
    if config.layout_preprocessing_contrast == "clahe":
        image = apply_clahe(image)
    elif config.layout_preprocessing_contrast == "histogram":
        image = apply_histogram_equalization(image)

    # Step 2: Sharpening (optional)
    if config.layout_preprocessing_sharpen:
        image = apply_unsharp_mask(image)

    # Step 3: Binarization (optional, aggressive)
    if config.layout_preprocessing_binarize:
        image = apply_adaptive_threshold(image)

    return image

Integration Point

# In ocr_service.py, before calling PP-Structure
if settings.layout_preprocessing_enabled:
    preprocessed_image = enhance_for_layout_detection(page_image, settings)
    pp_input = preprocessed_image
else:
    pp_input = page_image

# PP-Structure gets preprocessed (or original if disabled)
layout_results = self.structure_engine(pp_input)

# Image extraction still uses original
for element in layout_results:
    if element.type == "image":
        crop_image_from_original(page_image, element.bbox)  # Use original!

Risks / Trade-offs

Risk	Mitigation
Performance overhead	Preprocessing is fast (~50ms/page), enable/disable option
Over-enhancement artifacts	CLAHE clip limit prevents over-saturation, configurable
Memory spike for large images	Process one page at a time, discard preprocessed after use

Open Questions

Should binarization be applied before or after CLAHE?
- Current: After (enhances contrast first, then binarize if needed)
Should preprocessing parameters be tunable per-request or only server-wide?
- Current: Server-wide config only (simpler)

6.1 KiB Raw Blame History

Design: Layout Detection Image Preprocessing

Context

Current Flow

Proposed Flow

Goals / Non-Goals

Goals

Non-Goals

Decisions

Decision 1: Preprocess only for layout detection input

Decision 2: Use CLAHE (Contrast Limited Adaptive Histogram Equalization) as default

Decision 3: Preprocessing is applied in-memory, not saved to disk

Decision 4: Sharpening via Unsharp Mask

Decision 5: Hybrid Control Mode (Auto + Manual)

Decision 6: Preview API Design

Implementation Details

Preprocessing Pipeline

Integration Point

Risks / Trade-offs

Open Questions

6.1 KiB

Raw Blame History