- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
312 lines
15 KiB
Markdown
312 lines
15 KiB
Markdown
# ocr-processing Specification
|
|
|
|
## Purpose
|
|
TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive.
|
|
## Requirements
|
|
### Requirement: OCR Track Gap Filling with Raw OCR Regions
|
|
|
|
The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.
|
|
|
|
#### Scenario: Gap filling activates when coverage is low
|
|
- **GIVEN** an OCR track processing task
|
|
- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
|
|
- **THEN** the system SHALL activate gap filling
|
|
- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element
|
|
- **AND** supplement these regions as TEXT elements in the output
|
|
|
|
#### Scenario: Coverage is determined by IoA (Intersection over Area)
|
|
- **GIVEN** a Raw OCR text region with bounding box
|
|
- **WHEN** checking if the region is covered by PP-StructureV3
|
|
- **THEN** the region SHALL be considered covered if IoA (intersection area / OCR box area) exceeds the type-specific threshold
|
|
- **AND** IoA SHALL be used instead of IoU because it correctly measures "small box contained in large box" relationship
|
|
- **AND** regions not meeting the IoA criterion SHALL be marked as uncovered
|
|
|
|
#### Scenario: Element-type-specific IoA thresholds are applied
|
|
- **GIVEN** a Raw OCR region being evaluated for coverage
|
|
- **WHEN** comparing against PP-StructureV3 elements of different types
|
|
- **THEN** the system SHALL apply different IoA thresholds:
|
|
- TEXT, TITLE, HEADER, FOOTER: IoA > 0.6 (tolerates boundary errors)
|
|
- TABLE: IoA > 0.1 (strict filtering to preserve table structure)
|
|
- FIGURE, IMAGE: IoA > 0.8 (preserves text within figures like axis labels)
|
|
- **AND** a region is considered covered if it meets the threshold for ANY overlapping element
|
|
|
|
#### Scenario: Only TEXT elements are supplemented
|
|
- **GIVEN** uncovered Raw OCR regions identified for supplementation
|
|
- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
|
|
- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements
|
|
- **AND** only supplement regions as TEXT type to preserve structural integrity
|
|
|
|
#### Scenario: Supplemented regions meet confidence threshold
|
|
- **GIVEN** Raw OCR regions to be supplemented
|
|
- **WHEN** a region has confidence score below 0.3
|
|
- **THEN** the system SHALL skip that region
|
|
- **AND** only supplement regions with confidence >= 0.3
|
|
|
|
#### Scenario: Deduplication uses IoA instead of IoU
|
|
- **GIVEN** a Raw OCR region being considered for supplementation
|
|
- **WHEN** the region has IoA > 0.5 with any existing PP-StructureV3 TEXT element
|
|
- **THEN** the system SHALL skip that region to prevent duplicate text
|
|
- **AND** the original PP-StructureV3 element SHALL be preserved
|
|
|
|
#### Scenario: Reading order is recalculated after gap filling
|
|
- **GIVEN** supplemented elements have been added to the page
|
|
- **WHEN** assembling the final element list
|
|
- **THEN** the system SHALL recalculate reading order for the entire page
|
|
- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right)
|
|
- **AND** ensure logical document flow is maintained
|
|
|
|
#### Scenario: Coordinate alignment with ocr_dimensions
|
|
- **GIVEN** Raw OCR processing may involve image resizing
|
|
- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox
|
|
- **THEN** the system SHALL use ocr_dimensions to normalize coordinates
|
|
- **AND** ensure both sources reference the same coordinate space
|
|
- **AND** prevent coverage misdetection due to scale differences
|
|
|
|
#### Scenario: Supplemented elements have complete metadata
|
|
- **GIVEN** a Raw OCR region being added as supplemented element
|
|
- **WHEN** creating the DocumentElement
|
|
- **THEN** the element SHALL include page_number
|
|
- **AND** include confidence score from Raw OCR
|
|
- **AND** include original bbox coordinates
|
|
- **AND** optionally include source indicator for debugging
|
|
|
|
### Requirement: Gap Filling Track Isolation
|
|
|
|
The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.
|
|
|
|
#### Scenario: Gap filling only activates for OCR track
|
|
- **GIVEN** a document processing task
|
|
- **WHEN** the processing track is OCR
|
|
- **THEN** the system SHALL evaluate and apply gap filling as needed
|
|
- **AND** produce enhanced output with supplemented content
|
|
|
|
#### Scenario: Direct track is unaffected
|
|
- **GIVEN** a document processing task with Direct track
|
|
- **WHEN** the task is processed
|
|
- **THEN** the system SHALL NOT invoke any gap filling logic
|
|
- **AND** produce output identical to current Direct track behavior
|
|
|
|
#### Scenario: Hybrid track is unaffected
|
|
- **GIVEN** a document processing task with Hybrid track
|
|
- **WHEN** the task is processed
|
|
- **THEN** the system SHALL NOT invoke gap filling logic
|
|
- **AND** use existing Hybrid track processing pipeline
|
|
|
|
### Requirement: Gap Filling Configuration
|
|
|
|
The system SHALL provide configurable parameters for gap filling behavior.
|
|
|
|
#### Scenario: Gap filling can be disabled via configuration
|
|
- **GIVEN** gap_filling_enabled is set to false in configuration
|
|
- **WHEN** OCR track processing runs
|
|
- **THEN** the system SHALL skip all gap filling logic
|
|
- **AND** output only PP-StructureV3 results as before
|
|
|
|
#### Scenario: Coverage threshold is configurable
|
|
- **GIVEN** gap_filling_coverage_threshold is set to 0.8
|
|
- **WHEN** PP-StructureV3 coverage is 75%
|
|
- **THEN** the system SHALL activate gap filling
|
|
- **AND** supplement uncovered regions
|
|
|
|
#### Scenario: IoA thresholds are configurable per element type
|
|
- **GIVEN** custom IoA thresholds configured:
|
|
- gap_filling_ioa_threshold_text: 0.6
|
|
- gap_filling_ioa_threshold_table: 0.1
|
|
- gap_filling_ioa_threshold_figure: 0.8
|
|
- gap_filling_dedup_ioa_threshold: 0.5
|
|
- **WHEN** evaluating coverage and deduplication
|
|
- **THEN** the system SHALL use the configured values
|
|
- **AND** apply them consistently throughout gap filling process
|
|
|
|
#### Scenario: Confidence threshold is configurable
|
|
- **GIVEN** gap_filling_confidence_threshold is set to 0.5
|
|
- **WHEN** supplementing Raw OCR regions
|
|
- **THEN** the system SHALL only include regions with confidence >= 0.5
|
|
- **AND** filter out lower confidence regions
|
|
|
|
#### Scenario: Boundary shrinking reduces edge duplicates
|
|
- **GIVEN** gap_filling_shrink_pixels is set to 1
|
|
- **WHEN** evaluating coverage with IoA
|
|
- **THEN** the system SHALL shrink OCR bounding boxes inward by 1 pixel on each side
|
|
- **AND** this reduces false "uncovered" detection at region boundaries
|
|
|
|
### Requirement: Layout Model Selection
|
|
The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.
|
|
|
|
#### Scenario: User selects Chinese document model
|
|
- **GIVEN** a user is processing Chinese business documents (forms, contracts, invoices)
|
|
- **WHEN** the user selects "Chinese Document Model" (PP-DocLayout-S)
|
|
- **THEN** the OCR engine SHALL use the PP-DocLayout-S layout detection model
|
|
- **AND** the model SHALL be optimized for 23 Chinese document element types
|
|
- **AND** table and form detection accuracy SHALL be improved over the default model
|
|
|
|
#### Scenario: User selects standard model for English documents
|
|
- **GIVEN** a user is processing English academic papers or reports
|
|
- **WHEN** the user selects "Standard Model" (PubLayNet-based)
|
|
- **THEN** the OCR engine SHALL use the default PubLayNet-based layout detection model
|
|
- **AND** the model SHALL be optimized for English document layouts
|
|
|
|
#### Scenario: User selects CDLA model for specialized Chinese layout
|
|
- **GIVEN** a user is processing Chinese documents with complex layouts
|
|
- **WHEN** the user selects "CDLA Model"
|
|
- **THEN** the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
|
|
- **AND** the model SHALL provide specialized Chinese document layout analysis
|
|
|
|
#### Scenario: Layout model is sent via API request
|
|
- **GIVEN** a frontend application with model selection UI
|
|
- **WHEN** the user starts task processing with a selected model
|
|
- **THEN** the frontend SHALL send the model choice in the request body:
|
|
```json
|
|
POST /api/v2/tasks/{task_id}/start
|
|
{
|
|
"use_dual_track": true,
|
|
"force_track": "ocr",
|
|
"language": "ch",
|
|
"layout_model": "chinese"
|
|
}
|
|
```
|
|
- **AND** the backend SHALL configure PP-StructureV3 with the corresponding model
|
|
|
|
#### Scenario: Default model when not specified
|
|
- **GIVEN** an API request without `layout_model` parameter
|
|
- **WHEN** the task is started
|
|
- **THEN** the system SHALL use "chinese" (PP-DocLayout-S) as the default model
|
|
- **AND** processing SHALL work correctly without requiring model selection
|
|
|
|
#### Scenario: Invalid model name is rejected
|
|
- **GIVEN** a request with an invalid `layout_model` value
|
|
- **WHEN** the user sends `layout_model: "invalid_model"`
|
|
- **THEN** the API SHALL return 422 Validation Error
|
|
- **AND** provide a clear error message listing valid model options
|
|
|
|
### Requirement: Layout Model Selection UI
|
|
The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.
|
|
|
|
#### Scenario: Model options are displayed with descriptions
|
|
- **GIVEN** the model selection UI is displayed
|
|
- **WHEN** the user views the available options
|
|
- **THEN** the UI SHALL show the following options:
|
|
- "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
|
|
- "Standard Model" - for English academic papers, reports
|
|
- "CDLA Model" - for specialized Chinese layout analysis
|
|
- **AND** each option SHALL have a brief description of its use case
|
|
|
|
#### Scenario: Chinese model is selected by default
|
|
- **GIVEN** the user opens the task processing interface
|
|
- **WHEN** the model selection is displayed
|
|
- **THEN** "Chinese Document Model" SHALL be pre-selected as the default
|
|
- **AND** the user MAY change the selection before starting processing
|
|
|
|
#### Scenario: Model selection is visible only for OCR track
|
|
- **GIVEN** a document processing interface
|
|
- **WHEN** the user selects processing track
|
|
- **THEN** layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
|
|
- **AND** SHALL be hidden for Direct track (which does not use PP-StructureV3)
|
|
|
|
### Requirement: Model Cache Cleanup
|
|
|
|
The system SHALL provide documentation for cleaning up unused model caches to optimize storage space.
|
|
|
|
#### Scenario: User wants to free disk space after model upgrade
|
|
- **WHEN** the user has upgraded from older models (PP-DocLayout-S, SLANet) to newer models
|
|
- **THEN** the documentation SHALL explain how to delete unused cached models from `~/.paddlex/official_models/`
|
|
- **AND** list which model directories can be safely removed
|
|
|
|
### Requirement: Cell Over-Detection Filtering
|
|
|
|
The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells.
|
|
|
|
#### Scenario: Cell density exceeds threshold
|
|
- **GIVEN** a table detected by PP-StructureV3 with cell_boxes
|
|
- **WHEN** cell density exceeds 3.0 cells per 10,000 px²
|
|
- **THEN** the system SHALL flag the table as over-detected
|
|
- **AND** reclassify the table as a TEXT element
|
|
|
|
#### Scenario: Average cell area below threshold
|
|
- **GIVEN** a table detected by PP-StructureV3
|
|
- **WHEN** average cell area is less than 3,000 px²
|
|
- **THEN** the system SHALL flag the table as over-detected
|
|
- **AND** reclassify the table as a TEXT element
|
|
|
|
#### Scenario: Cell height too small
|
|
- **GIVEN** a table with height H and N cells
|
|
- **WHEN** (H / N) is less than 10 pixels
|
|
- **THEN** the system SHALL flag the table as over-detected
|
|
- **AND** reclassify the table as a TEXT element
|
|
|
|
#### Scenario: Valid tables are preserved
|
|
- **GIVEN** a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10)
|
|
- **WHEN** validation is applied
|
|
- **THEN** the table SHALL be preserved unchanged
|
|
- **AND** all cell_boxes SHALL be retained
|
|
|
|
### Requirement: Table-to-Text Reclassification
|
|
|
|
The system SHALL convert over-detected tables to TEXT elements while preserving content.
|
|
|
|
#### Scenario: Table content is preserved
|
|
- **GIVEN** a table flagged for reclassification
|
|
- **WHEN** converting to TEXT element
|
|
- **THEN** the system SHALL extract text content from table HTML
|
|
- **AND** preserve the original bounding box
|
|
- **AND** set element type to TEXT
|
|
|
|
#### Scenario: Reading order is recalculated
|
|
- **GIVEN** tables have been reclassified as TEXT
|
|
- **WHEN** assembling the final page structure
|
|
- **THEN** the system SHALL recalculate reading order
|
|
- **AND** sort elements by y0 then x0 coordinates
|
|
|
|
### Requirement: Validation Configuration
|
|
|
|
The system SHALL provide configurable thresholds for cell validation.
|
|
|
|
#### Scenario: Default thresholds are applied
|
|
- **GIVEN** no custom configuration is provided
|
|
- **WHEN** validating tables
|
|
- **THEN** the system SHALL use default thresholds:
|
|
- max_cell_density: 3.0 cells/10000px²
|
|
- min_avg_cell_area: 3000 px²
|
|
- min_cell_height: 10 px
|
|
|
|
#### Scenario: Custom thresholds can be configured
|
|
- **GIVEN** custom validation thresholds in configuration
|
|
- **WHEN** validating tables
|
|
- **THEN** the system SHALL use the custom values
|
|
- **AND** apply them consistently to all pages
|
|
|
|
### Requirement: Use PP-StructureV3 Internal OCR Results
|
|
|
|
The system SHALL preferentially use PP-StructureV3's internal OCR results (`overall_ocr_res`) instead of running a separate Raw OCR inference.
|
|
|
|
#### Scenario: Extract overall_ocr_res from PP-StructureV3
|
|
- **GIVEN** PP-StructureV3 processing completes
|
|
- **WHEN** the result contains `json['res']['overall_ocr_res']`
|
|
- **THEN** the system SHALL extract OCR regions from:
|
|
- `dt_polys`: detection box polygons
|
|
- `rec_texts`: recognized text strings
|
|
- `rec_scores`: confidence scores
|
|
- **AND** convert these to the standard TextRegion format for gap filling
|
|
|
|
#### Scenario: Skip separate Raw OCR when overall_ocr_res is available
|
|
- **GIVEN** gap_filling_use_overall_ocr is true (default)
|
|
- **WHEN** PP-StructureV3 result contains overall_ocr_res
|
|
- **THEN** the system SHALL NOT execute separate PaddleOCR inference
|
|
- **AND** use the extracted overall_ocr_res as the OCR source
|
|
- **AND** this reduces total inference time by approximately 50%
|
|
|
|
#### Scenario: Fallback to separate Raw OCR when needed
|
|
- **GIVEN** gap_filling_use_overall_ocr is false OR overall_ocr_res is missing
|
|
- **WHEN** gap filling is activated
|
|
- **THEN** the system SHALL execute separate PaddleOCR inference as before
|
|
- **AND** use the separate OCR results for gap filling
|
|
- **AND** this maintains backward compatibility
|
|
|
|
#### Scenario: Coordinate consistency is guaranteed
|
|
- **GIVEN** overall_ocr_res is extracted from PP-StructureV3
|
|
- **WHEN** comparing with PP-StructureV3 layout elements
|
|
- **THEN** both SHALL use the same coordinate system
|
|
- **AND** no additional coordinate alignment is needed
|
|
- **AND** this prevents scale mismatch issues
|
|
|