egg/OCR

Files

egg cfe65158a3 feat: enable document orientation detection for scanned PDFs

- Enable PP-StructureV3's use_doc_orientation_classify feature
- Detect rotation angle from doc_preprocessor_res.angle
- Swap page dimensions (width <-> height) for 90°/270° rotations
- Output PDF now correctly displays landscape-scanned content

Also includes:
- Archive completed openspec proposals
- Add simplify-frontend-ocr-config proposal (pending)
- Code cleanup and frontend simplification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 17:13:46 +08:00

15 KiB

Raw Blame History

ocr-processing Specification

Purpose

TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive.

Requirements

Requirement: OCR Track Gap Filling with Raw OCR Regions

The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.

Scenario: Gap filling activates when coverage is low

GIVEN an OCR track processing task
WHEN PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
THEN the system SHALL activate gap filling
AND identify Raw OCR regions not covered by any PP-StructureV3 element
AND supplement these regions as TEXT elements in the output

Scenario: Coverage is determined by IoA (Intersection over Area)

GIVEN a Raw OCR text region with bounding box
WHEN checking if the region is covered by PP-StructureV3
THEN the region SHALL be considered covered if IoA (intersection area / OCR box area) exceeds the type-specific threshold
AND IoA SHALL be used instead of IoU because it correctly measures "small box contained in large box" relationship
AND regions not meeting the IoA criterion SHALL be marked as uncovered

Scenario: Element-type-specific IoA thresholds are applied

GIVEN a Raw OCR region being evaluated for coverage
WHEN comparing against PP-StructureV3 elements of different types
THEN the system SHALL apply different IoA thresholds:
- TEXT, TITLE, HEADER, FOOTER: IoA > 0.6 (tolerates boundary errors)
- TABLE: IoA > 0.1 (strict filtering to preserve table structure)
- FIGURE, IMAGE: IoA > 0.8 (preserves text within figures like axis labels)
AND a region is considered covered if it meets the threshold for ANY overlapping element

Scenario: Only TEXT elements are supplemented

GIVEN uncovered Raw OCR regions identified for supplementation
WHEN PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
THEN the system SHALL NOT supplement regions that overlap with these structural elements
AND only supplement regions as TEXT type to preserve structural integrity

Scenario: Supplemented regions meet confidence threshold

GIVEN Raw OCR regions to be supplemented
WHEN a region has confidence score below 0.3
THEN the system SHALL skip that region
AND only supplement regions with confidence >= 0.3

Scenario: Deduplication uses IoA instead of IoU

GIVEN a Raw OCR region being considered for supplementation
WHEN the region has IoA > 0.5 with any existing PP-StructureV3 TEXT element
THEN the system SHALL skip that region to prevent duplicate text
AND the original PP-StructureV3 element SHALL be preserved

Scenario: Reading order is recalculated after gap filling

GIVEN supplemented elements have been added to the page
WHEN assembling the final element list
THEN the system SHALL recalculate reading order for the entire page
AND sort elements by y0 coordinate (top to bottom) then x0 (left to right)
AND ensure logical document flow is maintained

Scenario: Coordinate alignment with ocr_dimensions

GIVEN Raw OCR processing may involve image resizing
WHEN comparing Raw OCR bbox with PP-StructureV3 bbox
THEN the system SHALL use ocr_dimensions to normalize coordinates
AND ensure both sources reference the same coordinate space
AND prevent coverage misdetection due to scale differences

Scenario: Supplemented elements have complete metadata

GIVEN a Raw OCR region being added as supplemented element
WHEN creating the DocumentElement
THEN the element SHALL include page_number
AND include confidence score from Raw OCR
AND include original bbox coordinates
AND optionally include source indicator for debugging

Requirement: Gap Filling Track Isolation

The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.

Scenario: Gap filling only activates for OCR track

GIVEN a document processing task
WHEN the processing track is OCR
THEN the system SHALL evaluate and apply gap filling as needed
AND produce enhanced output with supplemented content

Scenario: Direct track is unaffected

GIVEN a document processing task with Direct track
WHEN the task is processed
THEN the system SHALL NOT invoke any gap filling logic
AND produce output identical to current Direct track behavior

Scenario: Hybrid track is unaffected

GIVEN a document processing task with Hybrid track
WHEN the task is processed
THEN the system SHALL NOT invoke gap filling logic
AND use existing Hybrid track processing pipeline

Requirement: Gap Filling Configuration

The system SHALL provide configurable parameters for gap filling behavior.

Scenario: Gap filling can be disabled via configuration

GIVEN gap_filling_enabled is set to false in configuration
WHEN OCR track processing runs
THEN the system SHALL skip all gap filling logic
AND output only PP-StructureV3 results as before

Scenario: Coverage threshold is configurable

GIVEN gap_filling_coverage_threshold is set to 0.8
WHEN PP-StructureV3 coverage is 75%
THEN the system SHALL activate gap filling
AND supplement uncovered regions

Scenario: IoA thresholds are configurable per element type

GIVEN custom IoA thresholds configured:
- gap_filling_ioa_threshold_text: 0.6
- gap_filling_ioa_threshold_table: 0.1
- gap_filling_ioa_threshold_figure: 0.8
- gap_filling_dedup_ioa_threshold: 0.5
WHEN evaluating coverage and deduplication
THEN the system SHALL use the configured values
AND apply them consistently throughout gap filling process

Scenario: Confidence threshold is configurable

GIVEN gap_filling_confidence_threshold is set to 0.5
WHEN supplementing Raw OCR regions
THEN the system SHALL only include regions with confidence >= 0.5
AND filter out lower confidence regions

Scenario: Boundary shrinking reduces edge duplicates

GIVEN gap_filling_shrink_pixels is set to 1
WHEN evaluating coverage with IoA
THEN the system SHALL shrink OCR bounding boxes inward by 1 pixel on each side
AND this reduces false "uncovered" detection at region boundaries

Requirement: Layout Model Selection

The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.

Scenario: User selects Chinese document model

GIVEN a user is processing Chinese business documents (forms, contracts, invoices)
WHEN the user selects "Chinese Document Model" (PP-DocLayout-S)
THEN the OCR engine SHALL use the PP-DocLayout-S layout detection model
AND the model SHALL be optimized for 23 Chinese document element types
AND table and form detection accuracy SHALL be improved over the default model

Scenario: User selects standard model for English documents

GIVEN a user is processing English academic papers or reports
WHEN the user selects "Standard Model" (PubLayNet-based)
THEN the OCR engine SHALL use the default PubLayNet-based layout detection model
AND the model SHALL be optimized for English document layouts

Scenario: User selects CDLA model for specialized Chinese layout

GIVEN a user is processing Chinese documents with complex layouts
WHEN the user selects "CDLA Model"
THEN the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
AND the model SHALL provide specialized Chinese document layout analysis

Scenario: Layout model is sent via API request

GIVEN a frontend application with model selection UI
WHEN the user starts task processing with a selected model

THEN the frontend SHALL send the model choice in the request body:

POST /api/v2/tasks/{task_id}/start
{
  "use_dual_track": true,
  "force_track": "ocr",
  "language": "ch",
  "layout_model": "chinese"
}

AND the backend SHALL configure PP-StructureV3 with the corresponding model

Scenario: Default model when not specified

GIVEN an API request without layout_model parameter
WHEN the task is started
THEN the system SHALL use "chinese" (PP-DocLayout-S) as the default model
AND processing SHALL work correctly without requiring model selection

Scenario: Invalid model name is rejected

GIVEN a request with an invalid layout_model value
WHEN the user sends layout_model: "invalid_model"
THEN the API SHALL return 422 Validation Error
AND provide a clear error message listing valid model options

Requirement: Layout Model Selection UI

The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.

Scenario: Model options are displayed with descriptions

GIVEN the model selection UI is displayed
WHEN the user views the available options
THEN the UI SHALL show the following options:
- "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
- "Standard Model" - for English academic papers, reports
- "CDLA Model" - for specialized Chinese layout analysis
AND each option SHALL have a brief description of its use case

Scenario: Chinese model is selected by default

GIVEN the user opens the task processing interface
WHEN the model selection is displayed
THEN "Chinese Document Model" SHALL be pre-selected as the default
AND the user MAY change the selection before starting processing

Scenario: Model selection is visible only for OCR track

GIVEN a document processing interface
WHEN the user selects processing track
THEN layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
AND SHALL be hidden for Direct track (which does not use PP-StructureV3)

Requirement: Model Cache Cleanup

The system SHALL provide documentation for cleaning up unused model caches to optimize storage space.

Scenario: User wants to free disk space after model upgrade

WHEN the user has upgraded from older models (PP-DocLayout-S, SLANet) to newer models
THEN the documentation SHALL explain how to delete unused cached models from ~/.paddlex/official_models/
AND list which model directories can be safely removed

Requirement: Cell Over-Detection Filtering

The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells.

Scenario: Cell density exceeds threshold

GIVEN a table detected by PP-StructureV3 with cell_boxes
WHEN cell density exceeds 3.0 cells per 10,000 px²
THEN the system SHALL flag the table as over-detected
AND reclassify the table as a TEXT element

Scenario: Average cell area below threshold

GIVEN a table detected by PP-StructureV3
WHEN average cell area is less than 3,000 px²
THEN the system SHALL flag the table as over-detected
AND reclassify the table as a TEXT element

Scenario: Cell height too small

GIVEN a table with height H and N cells
WHEN (H / N) is less than 10 pixels
THEN the system SHALL flag the table as over-detected
AND reclassify the table as a TEXT element

Scenario: Valid tables are preserved

GIVEN a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10)
WHEN validation is applied
THEN the table SHALL be preserved unchanged
AND all cell_boxes SHALL be retained

Requirement: Table-to-Text Reclassification

The system SHALL convert over-detected tables to TEXT elements while preserving content.

Scenario: Table content is preserved

GIVEN a table flagged for reclassification
WHEN converting to TEXT element
THEN the system SHALL extract text content from table HTML
AND preserve the original bounding box
AND set element type to TEXT

Scenario: Reading order is recalculated

GIVEN tables have been reclassified as TEXT
WHEN assembling the final page structure
THEN the system SHALL recalculate reading order
AND sort elements by y0 then x0 coordinates

Requirement: Validation Configuration

The system SHALL provide configurable thresholds for cell validation.

Scenario: Default thresholds are applied

GIVEN no custom configuration is provided
WHEN validating tables
THEN the system SHALL use default thresholds:
- max_cell_density: 3.0 cells/10000px²
- min_avg_cell_area: 3000 px²
- min_cell_height: 10 px

Scenario: Custom thresholds can be configured

GIVEN custom validation thresholds in configuration
WHEN validating tables
THEN the system SHALL use the custom values
AND apply them consistently to all pages

Requirement: Use PP-StructureV3 Internal OCR Results

The system SHALL preferentially use PP-StructureV3's internal OCR results (overall_ocr_res) instead of running a separate Raw OCR inference.

Scenario: Extract overall_ocr_res from PP-StructureV3

GIVEN PP-StructureV3 processing completes
WHEN the result contains json['res']['overall_ocr_res']
THEN the system SHALL extract OCR regions from:
- dt_polys: detection box polygons
- rec_texts: recognized text strings
- rec_scores: confidence scores
AND convert these to the standard TextRegion format for gap filling

Scenario: Skip separate Raw OCR when overall_ocr_res is available

GIVEN gap_filling_use_overall_ocr is true (default)
WHEN PP-StructureV3 result contains overall_ocr_res
THEN the system SHALL NOT execute separate PaddleOCR inference
AND use the extracted overall_ocr_res as the OCR source
AND this reduces total inference time by approximately 50%

Scenario: Fallback to separate Raw OCR when needed

GIVEN gap_filling_use_overall_ocr is false OR overall_ocr_res is missing
WHEN gap filling is activated
THEN the system SHALL execute separate PaddleOCR inference as before
AND use the separate OCR results for gap filling
AND this maintains backward compatibility

Scenario: Coordinate consistency is guaranteed

GIVEN overall_ocr_res is extracted from PP-StructureV3
WHEN comparing with PP-StructureV3 layout elements
THEN both SHALL use the same coordinate system
AND no additional coordinate alignment is needed
AND this prevents scale mismatch issues

15 KiB Raw Blame History