egg/OCR

Files

egg 59206a6ab8 feat: simplify layout model selection and archive proposals

Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-27 13:27:00 +08:00

9.0 KiB

Raw Blame History

ocr-processing Specification

Purpose

TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive.

Requirements

Requirement: OCR Track Gap Filling with Raw OCR Regions

The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.

Scenario: Gap filling activates when coverage is low

GIVEN an OCR track processing task
WHEN PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
THEN the system SHALL activate gap filling
AND identify Raw OCR regions not covered by any PP-StructureV3 element
AND supplement these regions as TEXT elements in the output

Scenario: Coverage is determined by center-point and IoU

GIVEN a Raw OCR text region with bounding box
WHEN checking if the region is covered by PP-StructureV3
THEN the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox
OR if IoU with any PP-StructureV3 element exceeds 0.15 threshold
AND regions not meeting either criterion SHALL be marked as uncovered

Scenario: Only TEXT elements are supplemented

GIVEN uncovered Raw OCR regions identified for supplementation
WHEN PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
THEN the system SHALL NOT supplement regions that overlap with these structural elements
AND only supplement regions as TEXT type to preserve structural integrity

Scenario: Supplemented regions meet confidence threshold

GIVEN Raw OCR regions to be supplemented
WHEN a region has confidence score below 0.3
THEN the system SHALL skip that region
AND only supplement regions with confidence >= 0.3

Scenario: Deduplication prevents repeated text

GIVEN a Raw OCR region being considered for supplementation
WHEN the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element
THEN the system SHALL skip that region to prevent duplicate text
AND the original PP-StructureV3 element SHALL be preserved

Scenario: Reading order is recalculated after gap filling

GIVEN supplemented elements have been added to the page
WHEN assembling the final element list
THEN the system SHALL recalculate reading order for the entire page
AND sort elements by y0 coordinate (top to bottom) then x0 (left to right)
AND ensure logical document flow is maintained

Scenario: Coordinate alignment with ocr_dimensions

GIVEN Raw OCR processing may involve image resizing
WHEN comparing Raw OCR bbox with PP-StructureV3 bbox
THEN the system SHALL use ocr_dimensions to normalize coordinates
AND ensure both sources reference the same coordinate space
AND prevent coverage misdetection due to scale differences

Scenario: Supplemented elements have complete metadata

GIVEN a Raw OCR region being added as supplemented element
WHEN creating the DocumentElement
THEN the element SHALL include page_number
AND include confidence score from Raw OCR
AND include original bbox coordinates
AND optionally include source indicator for debugging

Requirement: Gap Filling Track Isolation

The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.

Scenario: Gap filling only activates for OCR track

GIVEN a document processing task
WHEN the processing track is OCR
THEN the system SHALL evaluate and apply gap filling as needed
AND produce enhanced output with supplemented content

Scenario: Direct track is unaffected

GIVEN a document processing task with Direct track
WHEN the task is processed
THEN the system SHALL NOT invoke any gap filling logic
AND produce output identical to current Direct track behavior

Scenario: Hybrid track is unaffected

GIVEN a document processing task with Hybrid track
WHEN the task is processed
THEN the system SHALL NOT invoke gap filling logic
AND use existing Hybrid track processing pipeline

Requirement: Gap Filling Configuration

The system SHALL provide configurable parameters for gap filling behavior.

Scenario: Gap filling can be disabled via configuration

GIVEN gap_filling_enabled is set to false in configuration
WHEN OCR track processing runs
THEN the system SHALL skip all gap filling logic
AND output only PP-StructureV3 results as before

Scenario: Coverage threshold is configurable

GIVEN gap_filling_coverage_threshold is set to 0.8
WHEN PP-StructureV3 coverage is 75%
THEN the system SHALL activate gap filling
AND supplement uncovered regions

Scenario: IoU thresholds are configurable

GIVEN custom IoU thresholds configured:
- gap_filling_iou_threshold: 0.2
- gap_filling_dedup_iou_threshold: 0.6
WHEN evaluating coverage and deduplication
THEN the system SHALL use the configured values
AND apply them consistently throughout gap filling process

Scenario: Confidence threshold is configurable

GIVEN gap_filling_confidence_threshold is set to 0.5
WHEN supplementing Raw OCR regions
THEN the system SHALL only include regions with confidence >= 0.5
AND filter out lower confidence regions

Requirement: Layout Model Selection

The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.

Scenario: User selects Chinese document model

GIVEN a user is processing Chinese business documents (forms, contracts, invoices)
WHEN the user selects "Chinese Document Model" (PP-DocLayout-S)
THEN the OCR engine SHALL use the PP-DocLayout-S layout detection model
AND the model SHALL be optimized for 23 Chinese document element types
AND table and form detection accuracy SHALL be improved over the default model

Scenario: User selects standard model for English documents

GIVEN a user is processing English academic papers or reports
WHEN the user selects "Standard Model" (PubLayNet-based)
THEN the OCR engine SHALL use the default PubLayNet-based layout detection model
AND the model SHALL be optimized for English document layouts

Scenario: User selects CDLA model for specialized Chinese layout

GIVEN a user is processing Chinese documents with complex layouts
WHEN the user selects "CDLA Model"
THEN the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
AND the model SHALL provide specialized Chinese document layout analysis

Scenario: Layout model is sent via API request

GIVEN a frontend application with model selection UI
WHEN the user starts task processing with a selected model

THEN the frontend SHALL send the model choice in the request body:

POST /api/v2/tasks/{task_id}/start
{
  "use_dual_track": true,
  "force_track": "ocr",
  "language": "ch",
  "layout_model": "chinese"
}

AND the backend SHALL configure PP-StructureV3 with the corresponding model

Scenario: Default model when not specified

GIVEN an API request without layout_model parameter
WHEN the task is started
THEN the system SHALL use "chinese" (PP-DocLayout-S) as the default model
AND processing SHALL work correctly without requiring model selection

Scenario: Invalid model name is rejected

GIVEN a request with an invalid layout_model value
WHEN the user sends layout_model: "invalid_model"
THEN the API SHALL return 422 Validation Error
AND provide a clear error message listing valid model options

Requirement: Layout Model Selection UI

The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.

Scenario: Model options are displayed with descriptions

GIVEN the model selection UI is displayed
WHEN the user views the available options
THEN the UI SHALL show the following options:
- "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
- "Standard Model" - for English academic papers, reports
- "CDLA Model" - for specialized Chinese layout analysis
AND each option SHALL have a brief description of its use case

Scenario: Chinese model is selected by default

GIVEN the user opens the task processing interface
WHEN the model selection is displayed
THEN "Chinese Document Model" SHALL be pre-selected as the default
AND the user MAY change the selection before starting processing

Scenario: Model selection is visible only for OCR track

GIVEN a document processing interface
WHEN the user selects processing track
THEN layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
AND SHALL be hidden for Direct track (which does not use PP-StructureV3)

9.0 KiB Raw Blame History