feat: simplify layout model selection and archive proposals

Changes: - Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector - Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla - Add LayoutModelSelector component and zh-TW translations - Fix "default" model behavior with sentinel value for PubLayNet - Add gap filling service for OCR track coverage improvement - Add PP-Structure debug utilities - Archive completed/incomplete proposals: - add-ocr-track-gap-filling (complete) - fix-ocr-track-table-rendering (incomplete) - simplify-ppstructure-model-selection (22/25 tasks) - Add new layout model tests, archive old PP-Structure param tests - Update OpenSpec ocr-processing spec with layout model requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 13:27:00 +08:00
parent c65df754cf
commit 59206a6ab8
35 changed files with 3621 additions and 658 deletions
--- a/openspec/specs/ocr-processing/spec.md
+++ b/openspec/specs/ocr-processing/spec.md
@@ -3,100 +3,186 @@
 ## Purpose
 TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive.
 ## Requirements
-### Requirement: Frontend-Adjustable PP-StructureV3 Parameters
-The system SHALL allow frontend users to dynamically adjust PP-StructureV3 OCR parameters for fine-tuning document processing without backend configuration changes.
+### Requirement: OCR Track Gap Filling with Raw OCR Regions

-#### Scenario: User adjusts layout detection threshold
- **GIVEN** a user is processing a document with OCR track
- **WHEN** the user sets `layout_detection_threshold` to 0.1 (lower than default 0.2)
- **THEN** the OCR engine SHALL detect more layout blocks including weak signals
- **AND** the processing SHALL use the custom parameter instead of backend defaults
- **AND** the custom parameter SHALL NOT be cached for reuse
+The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.

-#### Scenario: User selects high-quality preset configuration
- **GIVEN** a user wants to process a complex document with many small text elements
- **WHEN** the user selects "High Quality" preset mode
- **THEN** the system SHALL automatically set:
-  - `layout_detection_threshold` to 0.1
-  - `layout_nms_threshold` to 0.15
-  - `text_det_thresh` to 0.1
-  - `text_det_box_thresh` to 0.2
- **AND** process the document with these optimized parameters
+#### Scenario: Gap filling activates when coverage is low
+- **GIVEN** an OCR track processing task
+- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
+- **THEN** the system SHALL activate gap filling
+- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element
+- **AND** supplement these regions as TEXT elements in the output

-#### Scenario: User adjusts text detection parameters
- **GIVEN** a document with low-contrast text
- **WHEN** the user sets:
-  - `text_det_thresh` to 0.05 (very low)
-  - `text_det_unclip_ratio` to 1.5 (larger boxes)
- **THEN** the OCR SHALL detect more small and low-contrast text
- **AND** text bounding boxes SHALL be expanded by the specified ratio
+#### Scenario: Coverage is determined by center-point and IoU
+- **GIVEN** a Raw OCR text region with bounding box
+- **WHEN** checking if the region is covered by PP-StructureV3
+- **THEN** the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox
+- **OR** if IoU with any PP-StructureV3 element exceeds 0.15 threshold
+- **AND** regions not meeting either criterion SHALL be marked as uncovered

-#### Scenario: Parameters are sent via API request body
- **GIVEN** a frontend application with parameter adjustment UI
- **WHEN** the user starts task processing with custom parameters
- **THEN** the frontend SHALL send parameters in the request body (not query params):
+#### Scenario: Only TEXT elements are supplemented
+- **GIVEN** uncovered Raw OCR regions identified for supplementation
+- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
+- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements
+- **AND** only supplement regions as TEXT type to preserve structural integrity
+
+#### Scenario: Supplemented regions meet confidence threshold
+- **GIVEN** Raw OCR regions to be supplemented
+- **WHEN** a region has confidence score below 0.3
+- **THEN** the system SHALL skip that region
+- **AND** only supplement regions with confidence >= 0.3
+
+#### Scenario: Deduplication prevents repeated text
+- **GIVEN** a Raw OCR region being considered for supplementation
+- **WHEN** the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element
+- **THEN** the system SHALL skip that region to prevent duplicate text
+- **AND** the original PP-StructureV3 element SHALL be preserved
+
+#### Scenario: Reading order is recalculated after gap filling
+- **GIVEN** supplemented elements have been added to the page
+- **WHEN** assembling the final element list
+- **THEN** the system SHALL recalculate reading order for the entire page
+- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right)
+- **AND** ensure logical document flow is maintained
+
+#### Scenario: Coordinate alignment with ocr_dimensions
+- **GIVEN** Raw OCR processing may involve image resizing
+- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox
+- **THEN** the system SHALL use ocr_dimensions to normalize coordinates
+- **AND** ensure both sources reference the same coordinate space
+- **AND** prevent coverage misdetection due to scale differences
+
+#### Scenario: Supplemented elements have complete metadata
+- **GIVEN** a Raw OCR region being added as supplemented element
+- **WHEN** creating the DocumentElement
+- **THEN** the element SHALL include page_number
+- **AND** include confidence score from Raw OCR
+- **AND** include original bbox coordinates
+- **AND** optionally include source indicator for debugging
+
+### Requirement: Gap Filling Track Isolation
+
+The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.
+
+#### Scenario: Gap filling only activates for OCR track
+- **GIVEN** a document processing task
+- **WHEN** the processing track is OCR
+- **THEN** the system SHALL evaluate and apply gap filling as needed
+- **AND** produce enhanced output with supplemented content
+
+#### Scenario: Direct track is unaffected
+- **GIVEN** a document processing task with Direct track
+- **WHEN** the task is processed
+- **THEN** the system SHALL NOT invoke any gap filling logic
+- **AND** produce output identical to current Direct track behavior
+
+#### Scenario: Hybrid track is unaffected
+- **GIVEN** a document processing task with Hybrid track
+- **WHEN** the task is processed
+- **THEN** the system SHALL NOT invoke gap filling logic
+- **AND** use existing Hybrid track processing pipeline
+
+### Requirement: Gap Filling Configuration
+
+The system SHALL provide configurable parameters for gap filling behavior.
+
+#### Scenario: Gap filling can be disabled via configuration
+- **GIVEN** gap_filling_enabled is set to false in configuration
+- **WHEN** OCR track processing runs
+- **THEN** the system SHALL skip all gap filling logic
+- **AND** output only PP-StructureV3 results as before
+
+#### Scenario: Coverage threshold is configurable
+- **GIVEN** gap_filling_coverage_threshold is set to 0.8
+- **WHEN** PP-StructureV3 coverage is 75%
+- **THEN** the system SHALL activate gap filling
+- **AND** supplement uncovered regions
+
+#### Scenario: IoU thresholds are configurable
+- **GIVEN** custom IoU thresholds configured:
+  - gap_filling_iou_threshold: 0.2
+  - gap_filling_dedup_iou_threshold: 0.6
+- **WHEN** evaluating coverage and deduplication
+- **THEN** the system SHALL use the configured values
+- **AND** apply them consistently throughout gap filling process
+
+#### Scenario: Confidence threshold is configurable
+- **GIVEN** gap_filling_confidence_threshold is set to 0.5
+- **WHEN** supplementing Raw OCR regions
+- **THEN** the system SHALL only include regions with confidence >= 0.5
+- **AND** filter out lower confidence regions
+
+### Requirement: Layout Model Selection
+The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.
+
+#### Scenario: User selects Chinese document model
+- **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices)
+- **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S)
+- **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model
+- **AND** the model SHALL be optimized for 23 Chinese document element types
+- **AND** table and form detection accuracy SHALL be improved over the default model
+
+#### Scenario: User selects standard model for English documents
+- **GIVEN** a user is processing English academic papers or reports
+- **WHEN** the user selects "Standard Model" (PubLayNet-based)
+- **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model
+- **AND** the model SHALL be optimized for English document layouts
+
+#### Scenario: User selects CDLA model for specialized Chinese layout
+- **GIVEN** a user is processing Chinese documents with complex layouts
+- **WHEN** the user selects "CDLA Model"
+- **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
+- **AND** the model SHALL provide specialized Chinese document layout analysis
+
+#### Scenario: Layout model is sent via API request
+- **GIVEN** a frontend application with model selection UI
+- **WHEN** the user starts task processing with a selected model
+- **THEN** the frontend SHALL send the model choice in the request body:
  ```json
  POST /api/v2/tasks/{task_id}/start
  {
    "use_dual_track": true,
    "force_track": "ocr",
    "language": "ch",
-    "pp_structure_params": {
-      "layout_detection_threshold": 0.15,
-      "layout_merge_bboxes_mode": "small",
-      "text_det_thresh": 0.1
-    }
+    "layout_model": "chinese"
  }
  ```
- **AND** the backend SHALL parse and apply these parameters
+- **AND** the backend SHALL configure PP-StructureV3 with the corresponding model

-#### Scenario: Backward compatibility is maintained
- **GIVEN** existing API clients without PP-StructureV3 parameter support
- **WHEN** a task is started without `pp_structure_params`
- **THEN** the system SHALL use backend default settings
- **AND** processing SHALL work exactly as before
- **AND** no errors SHALL occur
+#### Scenario: Default model when not specified
+- **GIVEN** an API request without `layout_model` parameter
+- **WHEN** the task is started
+- **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model
+- **AND** processing SHALL work correctly without requiring model selection

-#### Scenario: Invalid parameters are rejected
- **GIVEN** a request with invalid parameter values
- **WHEN** the user sends:
-  - `layout_detection_threshold` = 1.5 (exceeds max 1.0)
-  - `layout_merge_bboxes_mode` = "invalid" (not in allowed values)
+#### Scenario: Invalid model name is rejected
+- **GIVEN** a request with an invalid `layout_model` value
+- **WHEN** the user sends `layout_model: "invalid_model"`
 - **THEN** the API SHALL return 422 Validation Error
- **AND** provide clear error messages about invalid parameters
+- **AND** provide a clear error message listing valid model options

-#### Scenario: Custom parameters affect only current processing
- **GIVEN** multiple concurrent OCR processing tasks
- **WHEN** Task A uses custom parameters and Task B uses defaults
- **THEN** Task A SHALL process with its custom parameters
- **AND** Task B SHALL process with default parameters
- **AND** no parameter interference SHALL occur between tasks
+### Requirement: Layout Model Selection UI
+The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.

-### Requirement: PP-StructureV3 Parameter UI Controls
-The frontend SHALL provide intuitive UI controls for adjusting PP-StructureV3 parameters with appropriate constraints and help text.
+#### Scenario: Model options are displayed with descriptions
+- **GIVEN** the model selection UI is displayed
+- **WHEN** the user views the available options
+- **THEN** the UI SHALL show the following options:
+  - "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
+  - "Standard Model" - for English academic papers, reports
+  - "CDLA Model" - for specialized Chinese layout analysis
+- **AND** each option SHALL have a brief description of its use case

-#### Scenario: Slider controls for numeric parameters
- **GIVEN** the parameter adjustment UI is displayed
- **WHEN** the user adjusts a numeric parameter slider
- **THEN** the slider SHALL enforce min/max constraints:
-  - Threshold parameters: 0.0 to 1.0
-  - Ratio parameters: > 0 (typically 0.5 to 3.0)
- **AND** display current value in real-time
- **AND** show help text explaining the parameter effect
+#### Scenario: Chinese model is selected by default
+- **GIVEN** the user opens the task processing interface
+- **WHEN** the model selection is displayed
+- **THEN** "Chinese Document Model" SHALL be pre-selected as the default
+- **AND** the user MAY change the selection before starting processing

-#### Scenario: Dropdown for merge mode selection
- **GIVEN** the layout merge mode parameter
- **WHEN** the user clicks the dropdown
- **THEN** the UI SHALL show exactly three options:
-  - "small" (conservative merging)
-  - "large" (aggressive merging)
-  - "union" (middle ground)
- **AND** display description for each option
-
-#### Scenario: Parameters shown only for OCR track
+#### Scenario: Model selection is visible only for OCR track
 - **GIVEN** a document processing interface
 - **WHEN** the user selects processing track
- **THEN** PP-StructureV3 parameters SHALL be shown ONLY when OCR track is selected
- **AND** SHALL be hidden for Direct track
- **AND** SHALL be disabled for Auto track until track is determined
+- **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
+- **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3)