# ocr-processing Specification ## Purpose TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive. ## Requirements ### Requirement: OCR Track Gap Filling with Raw OCR Regions The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected. #### Scenario: Gap filling activates when coverage is low - **GIVEN** an OCR track processing task - **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions - **THEN** the system SHALL activate gap filling - **AND** identify Raw OCR regions not covered by any PP-StructureV3 element - **AND** supplement these regions as TEXT elements in the output #### Scenario: Coverage is determined by center-point and IoU - **GIVEN** a Raw OCR text region with bounding box - **WHEN** checking if the region is covered by PP-StructureV3 - **THEN** the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox - **OR** if IoU with any PP-StructureV3 element exceeds 0.15 threshold - **AND** regions not meeting either criterion SHALL be marked as uncovered #### Scenario: Only TEXT elements are supplemented - **GIVEN** uncovered Raw OCR regions identified for supplementation - **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements - **THEN** the system SHALL NOT supplement regions that overlap with these structural elements - **AND** only supplement regions as TEXT type to preserve structural integrity #### Scenario: Supplemented regions meet confidence threshold - **GIVEN** Raw OCR regions to be supplemented - **WHEN** a region has confidence score below 0.3 - **THEN** the system SHALL skip that region - **AND** only supplement regions with confidence >= 0.3 #### Scenario: Deduplication prevents repeated text - **GIVEN** a Raw OCR region being considered for supplementation - **WHEN** the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element - **THEN** the system SHALL skip that region to prevent duplicate text - **AND** the original PP-StructureV3 element SHALL be preserved #### Scenario: Reading order is recalculated after gap filling - **GIVEN** supplemented elements have been added to the page - **WHEN** assembling the final element list - **THEN** the system SHALL recalculate reading order for the entire page - **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right) - **AND** ensure logical document flow is maintained #### Scenario: Coordinate alignment with ocr_dimensions - **GIVEN** Raw OCR processing may involve image resizing - **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox - **THEN** the system SHALL use ocr_dimensions to normalize coordinates - **AND** ensure both sources reference the same coordinate space - **AND** prevent coverage misdetection due to scale differences #### Scenario: Supplemented elements have complete metadata - **GIVEN** a Raw OCR region being added as supplemented element - **WHEN** creating the DocumentElement - **THEN** the element SHALL include page_number - **AND** include confidence score from Raw OCR - **AND** include original bbox coordinates - **AND** optionally include source indicator for debugging ### Requirement: Gap Filling Track Isolation The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs. #### Scenario: Gap filling only activates for OCR track - **GIVEN** a document processing task - **WHEN** the processing track is OCR - **THEN** the system SHALL evaluate and apply gap filling as needed - **AND** produce enhanced output with supplemented content #### Scenario: Direct track is unaffected - **GIVEN** a document processing task with Direct track - **WHEN** the task is processed - **THEN** the system SHALL NOT invoke any gap filling logic - **AND** produce output identical to current Direct track behavior #### Scenario: Hybrid track is unaffected - **GIVEN** a document processing task with Hybrid track - **WHEN** the task is processed - **THEN** the system SHALL NOT invoke gap filling logic - **AND** use existing Hybrid track processing pipeline ### Requirement: Gap Filling Configuration The system SHALL provide configurable parameters for gap filling behavior. #### Scenario: Gap filling can be disabled via configuration - **GIVEN** gap_filling_enabled is set to false in configuration - **WHEN** OCR track processing runs - **THEN** the system SHALL skip all gap filling logic - **AND** output only PP-StructureV3 results as before #### Scenario: Coverage threshold is configurable - **GIVEN** gap_filling_coverage_threshold is set to 0.8 - **WHEN** PP-StructureV3 coverage is 75% - **THEN** the system SHALL activate gap filling - **AND** supplement uncovered regions #### Scenario: IoU thresholds are configurable - **GIVEN** custom IoU thresholds configured: - gap_filling_iou_threshold: 0.2 - gap_filling_dedup_iou_threshold: 0.6 - **WHEN** evaluating coverage and deduplication - **THEN** the system SHALL use the configured values - **AND** apply them consistently throughout gap filling process #### Scenario: Confidence threshold is configurable - **GIVEN** gap_filling_confidence_threshold is set to 0.5 - **WHEN** supplementing Raw OCR regions - **THEN** the system SHALL only include regions with confidence >= 0.5 - **AND** filter out lower confidence regions ### Requirement: Layout Model Selection The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning. #### Scenario: User selects Chinese document model - **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices) - **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S) - **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model - **AND** the model SHALL be optimized for 23 Chinese document element types - **AND** table and form detection accuracy SHALL be improved over the default model #### Scenario: User selects standard model for English documents - **GIVEN** a user is processing English academic papers or reports - **WHEN** the user selects "Standard Model" (PubLayNet-based) - **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model - **AND** the model SHALL be optimized for English document layouts #### Scenario: User selects CDLA model for specialized Chinese layout - **GIVEN** a user is processing Chinese documents with complex layouts - **WHEN** the user selects "CDLA Model" - **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model - **AND** the model SHALL provide specialized Chinese document layout analysis #### Scenario: Layout model is sent via API request - **GIVEN** a frontend application with model selection UI - **WHEN** the user starts task processing with a selected model - **THEN** the frontend SHALL send the model choice in the request body: ```json POST /api/v2/tasks/{task_id}/start { "use_dual_track": true, "force_track": "ocr", "language": "ch", "layout_model": "chinese" } ``` - **AND** the backend SHALL configure PP-StructureV3 with the corresponding model #### Scenario: Default model when not specified - **GIVEN** an API request without `layout_model` parameter - **WHEN** the task is started - **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model - **AND** processing SHALL work correctly without requiring model selection #### Scenario: Invalid model name is rejected - **GIVEN** a request with an invalid `layout_model` value - **WHEN** the user sends `layout_model: "invalid_model"` - **THEN** the API SHALL return 422 Validation Error - **AND** provide a clear error message listing valid model options ### Requirement: Layout Model Selection UI The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option. #### Scenario: Model options are displayed with descriptions - **GIVEN** the model selection UI is displayed - **WHEN** the user views the available options - **THEN** the UI SHALL show the following options: - "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices - "Standard Model" - for English academic papers, reports - "CDLA Model" - for specialized Chinese layout analysis - **AND** each option SHALL have a brief description of its use case #### Scenario: Chinese model is selected by default - **GIVEN** the user opens the task processing interface - **WHEN** the model selection is displayed - **THEN** "Chinese Document Model" SHALL be pre-selected as the default - **AND** the user MAY change the selection before starting processing #### Scenario: Model selection is visible only for OCR track - **GIVEN** a document processing interface - **WHEN** the user selects processing track - **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected - **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3) ### Requirement: Model Cache Cleanup The system SHALL provide documentation for cleaning up unused model caches to optimize storage space. #### Scenario: User wants to free disk space after model upgrade - **WHEN** the user has upgraded from older models (PP-DocLayout-S, SLANet) to newer models - **THEN** the documentation SHALL explain how to delete unused cached models from `~/.paddlex/official_models/` - **AND** list which model directories can be safely removed