Files
OCR/openspec/specs/ocr-processing/spec.md
egg cfe65158a3 feat: enable document orientation detection for scanned PDFs
- Enable PP-StructureV3's use_doc_orientation_classify feature
- Detect rotation angle from doc_preprocessor_res.angle
- Swap page dimensions (width <-> height) for 90°/270° rotations
- Output PDF now correctly displays landscape-scanned content

Also includes:
- Archive completed openspec proposals
- Add simplify-frontend-ocr-config proposal (pending)
- Code cleanup and frontend simplification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 17:13:46 +08:00

15 KiB

ocr-processing Specification

Purpose

TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive.

Requirements

Requirement: OCR Track Gap Filling with Raw OCR Regions

The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.

Scenario: Gap filling activates when coverage is low

  • GIVEN an OCR track processing task
  • WHEN PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
  • THEN the system SHALL activate gap filling
  • AND identify Raw OCR regions not covered by any PP-StructureV3 element
  • AND supplement these regions as TEXT elements in the output

Scenario: Coverage is determined by IoA (Intersection over Area)

  • GIVEN a Raw OCR text region with bounding box
  • WHEN checking if the region is covered by PP-StructureV3
  • THEN the region SHALL be considered covered if IoA (intersection area / OCR box area) exceeds the type-specific threshold
  • AND IoA SHALL be used instead of IoU because it correctly measures "small box contained in large box" relationship
  • AND regions not meeting the IoA criterion SHALL be marked as uncovered

Scenario: Element-type-specific IoA thresholds are applied

  • GIVEN a Raw OCR region being evaluated for coverage
  • WHEN comparing against PP-StructureV3 elements of different types
  • THEN the system SHALL apply different IoA thresholds:
    • TEXT, TITLE, HEADER, FOOTER: IoA > 0.6 (tolerates boundary errors)
    • TABLE: IoA > 0.1 (strict filtering to preserve table structure)
    • FIGURE, IMAGE: IoA > 0.8 (preserves text within figures like axis labels)
  • AND a region is considered covered if it meets the threshold for ANY overlapping element

Scenario: Only TEXT elements are supplemented

  • GIVEN uncovered Raw OCR regions identified for supplementation
  • WHEN PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
  • THEN the system SHALL NOT supplement regions that overlap with these structural elements
  • AND only supplement regions as TEXT type to preserve structural integrity

Scenario: Supplemented regions meet confidence threshold

  • GIVEN Raw OCR regions to be supplemented
  • WHEN a region has confidence score below 0.3
  • THEN the system SHALL skip that region
  • AND only supplement regions with confidence >= 0.3

Scenario: Deduplication uses IoA instead of IoU

  • GIVEN a Raw OCR region being considered for supplementation
  • WHEN the region has IoA > 0.5 with any existing PP-StructureV3 TEXT element
  • THEN the system SHALL skip that region to prevent duplicate text
  • AND the original PP-StructureV3 element SHALL be preserved

Scenario: Reading order is recalculated after gap filling

  • GIVEN supplemented elements have been added to the page
  • WHEN assembling the final element list
  • THEN the system SHALL recalculate reading order for the entire page
  • AND sort elements by y0 coordinate (top to bottom) then x0 (left to right)
  • AND ensure logical document flow is maintained

Scenario: Coordinate alignment with ocr_dimensions

  • GIVEN Raw OCR processing may involve image resizing
  • WHEN comparing Raw OCR bbox with PP-StructureV3 bbox
  • THEN the system SHALL use ocr_dimensions to normalize coordinates
  • AND ensure both sources reference the same coordinate space
  • AND prevent coverage misdetection due to scale differences

Scenario: Supplemented elements have complete metadata

  • GIVEN a Raw OCR region being added as supplemented element
  • WHEN creating the DocumentElement
  • THEN the element SHALL include page_number
  • AND include confidence score from Raw OCR
  • AND include original bbox coordinates
  • AND optionally include source indicator for debugging

Requirement: Gap Filling Track Isolation

The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.

Scenario: Gap filling only activates for OCR track

  • GIVEN a document processing task
  • WHEN the processing track is OCR
  • THEN the system SHALL evaluate and apply gap filling as needed
  • AND produce enhanced output with supplemented content

Scenario: Direct track is unaffected

  • GIVEN a document processing task with Direct track
  • WHEN the task is processed
  • THEN the system SHALL NOT invoke any gap filling logic
  • AND produce output identical to current Direct track behavior

Scenario: Hybrid track is unaffected

  • GIVEN a document processing task with Hybrid track
  • WHEN the task is processed
  • THEN the system SHALL NOT invoke gap filling logic
  • AND use existing Hybrid track processing pipeline

Requirement: Gap Filling Configuration

The system SHALL provide configurable parameters for gap filling behavior.

Scenario: Gap filling can be disabled via configuration

  • GIVEN gap_filling_enabled is set to false in configuration
  • WHEN OCR track processing runs
  • THEN the system SHALL skip all gap filling logic
  • AND output only PP-StructureV3 results as before

Scenario: Coverage threshold is configurable

  • GIVEN gap_filling_coverage_threshold is set to 0.8
  • WHEN PP-StructureV3 coverage is 75%
  • THEN the system SHALL activate gap filling
  • AND supplement uncovered regions

Scenario: IoA thresholds are configurable per element type

  • GIVEN custom IoA thresholds configured:
    • gap_filling_ioa_threshold_text: 0.6
    • gap_filling_ioa_threshold_table: 0.1
    • gap_filling_ioa_threshold_figure: 0.8
    • gap_filling_dedup_ioa_threshold: 0.5
  • WHEN evaluating coverage and deduplication
  • THEN the system SHALL use the configured values
  • AND apply them consistently throughout gap filling process

Scenario: Confidence threshold is configurable

  • GIVEN gap_filling_confidence_threshold is set to 0.5
  • WHEN supplementing Raw OCR regions
  • THEN the system SHALL only include regions with confidence >= 0.5
  • AND filter out lower confidence regions

Scenario: Boundary shrinking reduces edge duplicates

  • GIVEN gap_filling_shrink_pixels is set to 1
  • WHEN evaluating coverage with IoA
  • THEN the system SHALL shrink OCR bounding boxes inward by 1 pixel on each side
  • AND this reduces false "uncovered" detection at region boundaries

Requirement: Layout Model Selection

The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.

Scenario: User selects Chinese document model

  • GIVEN a user is processing Chinese business documents (forms, contracts, invoices)
  • WHEN the user selects "Chinese Document Model" (PP-DocLayout-S)
  • THEN the OCR engine SHALL use the PP-DocLayout-S layout detection model
  • AND the model SHALL be optimized for 23 Chinese document element types
  • AND table and form detection accuracy SHALL be improved over the default model

Scenario: User selects standard model for English documents

  • GIVEN a user is processing English academic papers or reports
  • WHEN the user selects "Standard Model" (PubLayNet-based)
  • THEN the OCR engine SHALL use the default PubLayNet-based layout detection model
  • AND the model SHALL be optimized for English document layouts

Scenario: User selects CDLA model for specialized Chinese layout

  • GIVEN a user is processing Chinese documents with complex layouts
  • WHEN the user selects "CDLA Model"
  • THEN the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
  • AND the model SHALL provide specialized Chinese document layout analysis

Scenario: Layout model is sent via API request

  • GIVEN a frontend application with model selection UI
  • WHEN the user starts task processing with a selected model
  • THEN the frontend SHALL send the model choice in the request body:
    POST /api/v2/tasks/{task_id}/start
    {
      "use_dual_track": true,
      "force_track": "ocr",
      "language": "ch",
      "layout_model": "chinese"
    }
    
  • AND the backend SHALL configure PP-StructureV3 with the corresponding model

Scenario: Default model when not specified

  • GIVEN an API request without layout_model parameter
  • WHEN the task is started
  • THEN the system SHALL use "chinese" (PP-DocLayout-S) as the default model
  • AND processing SHALL work correctly without requiring model selection

Scenario: Invalid model name is rejected

  • GIVEN a request with an invalid layout_model value
  • WHEN the user sends layout_model: "invalid_model"
  • THEN the API SHALL return 422 Validation Error
  • AND provide a clear error message listing valid model options

Requirement: Layout Model Selection UI

The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.

Scenario: Model options are displayed with descriptions

  • GIVEN the model selection UI is displayed
  • WHEN the user views the available options
  • THEN the UI SHALL show the following options:
    • "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
    • "Standard Model" - for English academic papers, reports
    • "CDLA Model" - for specialized Chinese layout analysis
  • AND each option SHALL have a brief description of its use case

Scenario: Chinese model is selected by default

  • GIVEN the user opens the task processing interface
  • WHEN the model selection is displayed
  • THEN "Chinese Document Model" SHALL be pre-selected as the default
  • AND the user MAY change the selection before starting processing

Scenario: Model selection is visible only for OCR track

  • GIVEN a document processing interface
  • WHEN the user selects processing track
  • THEN layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
  • AND SHALL be hidden for Direct track (which does not use PP-StructureV3)

Requirement: Model Cache Cleanup

The system SHALL provide documentation for cleaning up unused model caches to optimize storage space.

Scenario: User wants to free disk space after model upgrade

  • WHEN the user has upgraded from older models (PP-DocLayout-S, SLANet) to newer models
  • THEN the documentation SHALL explain how to delete unused cached models from ~/.paddlex/official_models/
  • AND list which model directories can be safely removed

Requirement: Cell Over-Detection Filtering

The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells.

Scenario: Cell density exceeds threshold

  • GIVEN a table detected by PP-StructureV3 with cell_boxes
  • WHEN cell density exceeds 3.0 cells per 10,000 px²
  • THEN the system SHALL flag the table as over-detected
  • AND reclassify the table as a TEXT element

Scenario: Average cell area below threshold

  • GIVEN a table detected by PP-StructureV3
  • WHEN average cell area is less than 3,000 px²
  • THEN the system SHALL flag the table as over-detected
  • AND reclassify the table as a TEXT element

Scenario: Cell height too small

  • GIVEN a table with height H and N cells
  • WHEN (H / N) is less than 10 pixels
  • THEN the system SHALL flag the table as over-detected
  • AND reclassify the table as a TEXT element

Scenario: Valid tables are preserved

  • GIVEN a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10)
  • WHEN validation is applied
  • THEN the table SHALL be preserved unchanged
  • AND all cell_boxes SHALL be retained

Requirement: Table-to-Text Reclassification

The system SHALL convert over-detected tables to TEXT elements while preserving content.

Scenario: Table content is preserved

  • GIVEN a table flagged for reclassification
  • WHEN converting to TEXT element
  • THEN the system SHALL extract text content from table HTML
  • AND preserve the original bounding box
  • AND set element type to TEXT

Scenario: Reading order is recalculated

  • GIVEN tables have been reclassified as TEXT
  • WHEN assembling the final page structure
  • THEN the system SHALL recalculate reading order
  • AND sort elements by y0 then x0 coordinates

Requirement: Validation Configuration

The system SHALL provide configurable thresholds for cell validation.

Scenario: Default thresholds are applied

  • GIVEN no custom configuration is provided
  • WHEN validating tables
  • THEN the system SHALL use default thresholds:
    • max_cell_density: 3.0 cells/10000px²
    • min_avg_cell_area: 3000 px²
    • min_cell_height: 10 px

Scenario: Custom thresholds can be configured

  • GIVEN custom validation thresholds in configuration
  • WHEN validating tables
  • THEN the system SHALL use the custom values
  • AND apply them consistently to all pages

Requirement: Use PP-StructureV3 Internal OCR Results

The system SHALL preferentially use PP-StructureV3's internal OCR results (overall_ocr_res) instead of running a separate Raw OCR inference.

Scenario: Extract overall_ocr_res from PP-StructureV3

  • GIVEN PP-StructureV3 processing completes
  • WHEN the result contains json['res']['overall_ocr_res']
  • THEN the system SHALL extract OCR regions from:
    • dt_polys: detection box polygons
    • rec_texts: recognized text strings
    • rec_scores: confidence scores
  • AND convert these to the standard TextRegion format for gap filling

Scenario: Skip separate Raw OCR when overall_ocr_res is available

  • GIVEN gap_filling_use_overall_ocr is true (default)
  • WHEN PP-StructureV3 result contains overall_ocr_res
  • THEN the system SHALL NOT execute separate PaddleOCR inference
  • AND use the extracted overall_ocr_res as the OCR source
  • AND this reduces total inference time by approximately 50%

Scenario: Fallback to separate Raw OCR when needed

  • GIVEN gap_filling_use_overall_ocr is false OR overall_ocr_res is missing
  • WHEN gap filling is activated
  • THEN the system SHALL execute separate PaddleOCR inference as before
  • AND use the separate OCR results for gap filling
  • AND this maintains backward compatibility

Scenario: Coordinate consistency is guaranteed

  • GIVEN overall_ocr_res is extracted from PP-StructureV3
  • WHEN comparing with PP-StructureV3 layout elements
  • THEN both SHALL use the same coordinate system
  • AND no additional coordinate alignment is needed
  • AND this prevents scale mismatch issues