feat: simplify layout model selection and archive proposals
Changes: - Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector - Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla - Add LayoutModelSelector component and zh-TW translations - Fix "default" model behavior with sentinel value for PubLayNet - Add gap filling service for OCR track coverage improvement - Add PP-Structure debug utilities - Archive completed/incomplete proposals: - add-ocr-track-gap-filling (complete) - fix-ocr-track-table-rendering (incomplete) - simplify-ppstructure-model-selection (22/25 tasks) - Add new layout model tests, archive old PP-Structure param tests - Update OpenSpec ocr-processing spec with layout model requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -3,100 +3,186 @@
|
||||
## Purpose
|
||||
TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive.
|
||||
## Requirements
|
||||
### Requirement: Frontend-Adjustable PP-StructureV3 Parameters
|
||||
The system SHALL allow frontend users to dynamically adjust PP-StructureV3 OCR parameters for fine-tuning document processing without backend configuration changes.
|
||||
### Requirement: OCR Track Gap Filling with Raw OCR Regions
|
||||
|
||||
#### Scenario: User adjusts layout detection threshold
|
||||
- **GIVEN** a user is processing a document with OCR track
|
||||
- **WHEN** the user sets `layout_detection_threshold` to 0.1 (lower than default 0.2)
|
||||
- **THEN** the OCR engine SHALL detect more layout blocks including weak signals
|
||||
- **AND** the processing SHALL use the custom parameter instead of backend defaults
|
||||
- **AND** the custom parameter SHALL NOT be cached for reuse
|
||||
The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.
|
||||
|
||||
#### Scenario: User selects high-quality preset configuration
|
||||
- **GIVEN** a user wants to process a complex document with many small text elements
|
||||
- **WHEN** the user selects "High Quality" preset mode
|
||||
- **THEN** the system SHALL automatically set:
|
||||
- `layout_detection_threshold` to 0.1
|
||||
- `layout_nms_threshold` to 0.15
|
||||
- `text_det_thresh` to 0.1
|
||||
- `text_det_box_thresh` to 0.2
|
||||
- **AND** process the document with these optimized parameters
|
||||
#### Scenario: Gap filling activates when coverage is low
|
||||
- **GIVEN** an OCR track processing task
|
||||
- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
|
||||
- **THEN** the system SHALL activate gap filling
|
||||
- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element
|
||||
- **AND** supplement these regions as TEXT elements in the output
|
||||
|
||||
#### Scenario: User adjusts text detection parameters
|
||||
- **GIVEN** a document with low-contrast text
|
||||
- **WHEN** the user sets:
|
||||
- `text_det_thresh` to 0.05 (very low)
|
||||
- `text_det_unclip_ratio` to 1.5 (larger boxes)
|
||||
- **THEN** the OCR SHALL detect more small and low-contrast text
|
||||
- **AND** text bounding boxes SHALL be expanded by the specified ratio
|
||||
#### Scenario: Coverage is determined by center-point and IoU
|
||||
- **GIVEN** a Raw OCR text region with bounding box
|
||||
- **WHEN** checking if the region is covered by PP-StructureV3
|
||||
- **THEN** the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox
|
||||
- **OR** if IoU with any PP-StructureV3 element exceeds 0.15 threshold
|
||||
- **AND** regions not meeting either criterion SHALL be marked as uncovered
|
||||
|
||||
#### Scenario: Parameters are sent via API request body
|
||||
- **GIVEN** a frontend application with parameter adjustment UI
|
||||
- **WHEN** the user starts task processing with custom parameters
|
||||
- **THEN** the frontend SHALL send parameters in the request body (not query params):
|
||||
#### Scenario: Only TEXT elements are supplemented
|
||||
- **GIVEN** uncovered Raw OCR regions identified for supplementation
|
||||
- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
|
||||
- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements
|
||||
- **AND** only supplement regions as TEXT type to preserve structural integrity
|
||||
|
||||
#### Scenario: Supplemented regions meet confidence threshold
|
||||
- **GIVEN** Raw OCR regions to be supplemented
|
||||
- **WHEN** a region has confidence score below 0.3
|
||||
- **THEN** the system SHALL skip that region
|
||||
- **AND** only supplement regions with confidence >= 0.3
|
||||
|
||||
#### Scenario: Deduplication prevents repeated text
|
||||
- **GIVEN** a Raw OCR region being considered for supplementation
|
||||
- **WHEN** the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element
|
||||
- **THEN** the system SHALL skip that region to prevent duplicate text
|
||||
- **AND** the original PP-StructureV3 element SHALL be preserved
|
||||
|
||||
#### Scenario: Reading order is recalculated after gap filling
|
||||
- **GIVEN** supplemented elements have been added to the page
|
||||
- **WHEN** assembling the final element list
|
||||
- **THEN** the system SHALL recalculate reading order for the entire page
|
||||
- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right)
|
||||
- **AND** ensure logical document flow is maintained
|
||||
|
||||
#### Scenario: Coordinate alignment with ocr_dimensions
|
||||
- **GIVEN** Raw OCR processing may involve image resizing
|
||||
- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox
|
||||
- **THEN** the system SHALL use ocr_dimensions to normalize coordinates
|
||||
- **AND** ensure both sources reference the same coordinate space
|
||||
- **AND** prevent coverage misdetection due to scale differences
|
||||
|
||||
#### Scenario: Supplemented elements have complete metadata
|
||||
- **GIVEN** a Raw OCR region being added as supplemented element
|
||||
- **WHEN** creating the DocumentElement
|
||||
- **THEN** the element SHALL include page_number
|
||||
- **AND** include confidence score from Raw OCR
|
||||
- **AND** include original bbox coordinates
|
||||
- **AND** optionally include source indicator for debugging
|
||||
|
||||
### Requirement: Gap Filling Track Isolation
|
||||
|
||||
The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.
|
||||
|
||||
#### Scenario: Gap filling only activates for OCR track
|
||||
- **GIVEN** a document processing task
|
||||
- **WHEN** the processing track is OCR
|
||||
- **THEN** the system SHALL evaluate and apply gap filling as needed
|
||||
- **AND** produce enhanced output with supplemented content
|
||||
|
||||
#### Scenario: Direct track is unaffected
|
||||
- **GIVEN** a document processing task with Direct track
|
||||
- **WHEN** the task is processed
|
||||
- **THEN** the system SHALL NOT invoke any gap filling logic
|
||||
- **AND** produce output identical to current Direct track behavior
|
||||
|
||||
#### Scenario: Hybrid track is unaffected
|
||||
- **GIVEN** a document processing task with Hybrid track
|
||||
- **WHEN** the task is processed
|
||||
- **THEN** the system SHALL NOT invoke gap filling logic
|
||||
- **AND** use existing Hybrid track processing pipeline
|
||||
|
||||
### Requirement: Gap Filling Configuration
|
||||
|
||||
The system SHALL provide configurable parameters for gap filling behavior.
|
||||
|
||||
#### Scenario: Gap filling can be disabled via configuration
|
||||
- **GIVEN** gap_filling_enabled is set to false in configuration
|
||||
- **WHEN** OCR track processing runs
|
||||
- **THEN** the system SHALL skip all gap filling logic
|
||||
- **AND** output only PP-StructureV3 results as before
|
||||
|
||||
#### Scenario: Coverage threshold is configurable
|
||||
- **GIVEN** gap_filling_coverage_threshold is set to 0.8
|
||||
- **WHEN** PP-StructureV3 coverage is 75%
|
||||
- **THEN** the system SHALL activate gap filling
|
||||
- **AND** supplement uncovered regions
|
||||
|
||||
#### Scenario: IoU thresholds are configurable
|
||||
- **GIVEN** custom IoU thresholds configured:
|
||||
- gap_filling_iou_threshold: 0.2
|
||||
- gap_filling_dedup_iou_threshold: 0.6
|
||||
- **WHEN** evaluating coverage and deduplication
|
||||
- **THEN** the system SHALL use the configured values
|
||||
- **AND** apply them consistently throughout gap filling process
|
||||
|
||||
#### Scenario: Confidence threshold is configurable
|
||||
- **GIVEN** gap_filling_confidence_threshold is set to 0.5
|
||||
- **WHEN** supplementing Raw OCR regions
|
||||
- **THEN** the system SHALL only include regions with confidence >= 0.5
|
||||
- **AND** filter out lower confidence regions
|
||||
|
||||
### Requirement: Layout Model Selection
|
||||
The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.
|
||||
|
||||
#### Scenario: User selects Chinese document model
|
||||
- **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices)
|
||||
- **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S)
|
||||
- **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model
|
||||
- **AND** the model SHALL be optimized for 23 Chinese document element types
|
||||
- **AND** table and form detection accuracy SHALL be improved over the default model
|
||||
|
||||
#### Scenario: User selects standard model for English documents
|
||||
- **GIVEN** a user is processing English academic papers or reports
|
||||
- **WHEN** the user selects "Standard Model" (PubLayNet-based)
|
||||
- **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model
|
||||
- **AND** the model SHALL be optimized for English document layouts
|
||||
|
||||
#### Scenario: User selects CDLA model for specialized Chinese layout
|
||||
- **GIVEN** a user is processing Chinese documents with complex layouts
|
||||
- **WHEN** the user selects "CDLA Model"
|
||||
- **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
|
||||
- **AND** the model SHALL provide specialized Chinese document layout analysis
|
||||
|
||||
#### Scenario: Layout model is sent via API request
|
||||
- **GIVEN** a frontend application with model selection UI
|
||||
- **WHEN** the user starts task processing with a selected model
|
||||
- **THEN** the frontend SHALL send the model choice in the request body:
|
||||
```json
|
||||
POST /api/v2/tasks/{task_id}/start
|
||||
{
|
||||
"use_dual_track": true,
|
||||
"force_track": "ocr",
|
||||
"language": "ch",
|
||||
"pp_structure_params": {
|
||||
"layout_detection_threshold": 0.15,
|
||||
"layout_merge_bboxes_mode": "small",
|
||||
"text_det_thresh": 0.1
|
||||
}
|
||||
"layout_model": "chinese"
|
||||
}
|
||||
```
|
||||
- **AND** the backend SHALL parse and apply these parameters
|
||||
- **AND** the backend SHALL configure PP-StructureV3 with the corresponding model
|
||||
|
||||
#### Scenario: Backward compatibility is maintained
|
||||
- **GIVEN** existing API clients without PP-StructureV3 parameter support
|
||||
- **WHEN** a task is started without `pp_structure_params`
|
||||
- **THEN** the system SHALL use backend default settings
|
||||
- **AND** processing SHALL work exactly as before
|
||||
- **AND** no errors SHALL occur
|
||||
#### Scenario: Default model when not specified
|
||||
- **GIVEN** an API request without `layout_model` parameter
|
||||
- **WHEN** the task is started
|
||||
- **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model
|
||||
- **AND** processing SHALL work correctly without requiring model selection
|
||||
|
||||
#### Scenario: Invalid parameters are rejected
|
||||
- **GIVEN** a request with invalid parameter values
|
||||
- **WHEN** the user sends:
|
||||
- `layout_detection_threshold` = 1.5 (exceeds max 1.0)
|
||||
- `layout_merge_bboxes_mode` = "invalid" (not in allowed values)
|
||||
#### Scenario: Invalid model name is rejected
|
||||
- **GIVEN** a request with an invalid `layout_model` value
|
||||
- **WHEN** the user sends `layout_model: "invalid_model"`
|
||||
- **THEN** the API SHALL return 422 Validation Error
|
||||
- **AND** provide clear error messages about invalid parameters
|
||||
- **AND** provide a clear error message listing valid model options
|
||||
|
||||
#### Scenario: Custom parameters affect only current processing
|
||||
- **GIVEN** multiple concurrent OCR processing tasks
|
||||
- **WHEN** Task A uses custom parameters and Task B uses defaults
|
||||
- **THEN** Task A SHALL process with its custom parameters
|
||||
- **AND** Task B SHALL process with default parameters
|
||||
- **AND** no parameter interference SHALL occur between tasks
|
||||
### Requirement: Layout Model Selection UI
|
||||
The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.
|
||||
|
||||
### Requirement: PP-StructureV3 Parameter UI Controls
|
||||
The frontend SHALL provide intuitive UI controls for adjusting PP-StructureV3 parameters with appropriate constraints and help text.
|
||||
#### Scenario: Model options are displayed with descriptions
|
||||
- **GIVEN** the model selection UI is displayed
|
||||
- **WHEN** the user views the available options
|
||||
- **THEN** the UI SHALL show the following options:
|
||||
- "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
|
||||
- "Standard Model" - for English academic papers, reports
|
||||
- "CDLA Model" - for specialized Chinese layout analysis
|
||||
- **AND** each option SHALL have a brief description of its use case
|
||||
|
||||
#### Scenario: Slider controls for numeric parameters
|
||||
- **GIVEN** the parameter adjustment UI is displayed
|
||||
- **WHEN** the user adjusts a numeric parameter slider
|
||||
- **THEN** the slider SHALL enforce min/max constraints:
|
||||
- Threshold parameters: 0.0 to 1.0
|
||||
- Ratio parameters: > 0 (typically 0.5 to 3.0)
|
||||
- **AND** display current value in real-time
|
||||
- **AND** show help text explaining the parameter effect
|
||||
#### Scenario: Chinese model is selected by default
|
||||
- **GIVEN** the user opens the task processing interface
|
||||
- **WHEN** the model selection is displayed
|
||||
- **THEN** "Chinese Document Model" SHALL be pre-selected as the default
|
||||
- **AND** the user MAY change the selection before starting processing
|
||||
|
||||
#### Scenario: Dropdown for merge mode selection
|
||||
- **GIVEN** the layout merge mode parameter
|
||||
- **WHEN** the user clicks the dropdown
|
||||
- **THEN** the UI SHALL show exactly three options:
|
||||
- "small" (conservative merging)
|
||||
- "large" (aggressive merging)
|
||||
- "union" (middle ground)
|
||||
- **AND** display description for each option
|
||||
|
||||
#### Scenario: Parameters shown only for OCR track
|
||||
#### Scenario: Model selection is visible only for OCR track
|
||||
- **GIVEN** a document processing interface
|
||||
- **WHEN** the user selects processing track
|
||||
- **THEN** PP-StructureV3 parameters SHALL be shown ONLY when OCR track is selected
|
||||
- **AND** SHALL be hidden for Direct track
|
||||
- **AND** SHALL be disabled for Auto track until track is determined
|
||||
- **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
|
||||
- **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user