Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
12 KiB
ocr-processing Specification
Purpose
TBD - created by archiving change frontend-adjustable-ppstructure-params. Update Purpose after archive.
Requirements
Requirement: OCR Track Gap Filling with Raw OCR Regions
The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.
Scenario: Gap filling activates when coverage is low
- GIVEN an OCR track processing task
- WHEN PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
- THEN the system SHALL activate gap filling
- AND identify Raw OCR regions not covered by any PP-StructureV3 element
- AND supplement these regions as TEXT elements in the output
Scenario: Coverage is determined by center-point and IoU
- GIVEN a Raw OCR text region with bounding box
- WHEN checking if the region is covered by PP-StructureV3
- THEN the region SHALL be considered covered if its center point falls inside any PP-StructureV3 element bbox
- OR if IoU with any PP-StructureV3 element exceeds 0.15 threshold
- AND regions not meeting either criterion SHALL be marked as uncovered
Scenario: Only TEXT elements are supplemented
- GIVEN uncovered Raw OCR regions identified for supplementation
- WHEN PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
- THEN the system SHALL NOT supplement regions that overlap with these structural elements
- AND only supplement regions as TEXT type to preserve structural integrity
Scenario: Supplemented regions meet confidence threshold
- GIVEN Raw OCR regions to be supplemented
- WHEN a region has confidence score below 0.3
- THEN the system SHALL skip that region
- AND only supplement regions with confidence >= 0.3
Scenario: Deduplication prevents repeated text
- GIVEN a Raw OCR region being considered for supplementation
- WHEN the region has IoU > 0.5 with any existing PP-StructureV3 TEXT element
- THEN the system SHALL skip that region to prevent duplicate text
- AND the original PP-StructureV3 element SHALL be preserved
Scenario: Reading order is recalculated after gap filling
- GIVEN supplemented elements have been added to the page
- WHEN assembling the final element list
- THEN the system SHALL recalculate reading order for the entire page
- AND sort elements by y0 coordinate (top to bottom) then x0 (left to right)
- AND ensure logical document flow is maintained
Scenario: Coordinate alignment with ocr_dimensions
- GIVEN Raw OCR processing may involve image resizing
- WHEN comparing Raw OCR bbox with PP-StructureV3 bbox
- THEN the system SHALL use ocr_dimensions to normalize coordinates
- AND ensure both sources reference the same coordinate space
- AND prevent coverage misdetection due to scale differences
Scenario: Supplemented elements have complete metadata
- GIVEN a Raw OCR region being added as supplemented element
- WHEN creating the DocumentElement
- THEN the element SHALL include page_number
- AND include confidence score from Raw OCR
- AND include original bbox coordinates
- AND optionally include source indicator for debugging
Requirement: Gap Filling Track Isolation
The gap filling feature SHALL only apply to OCR track processing and SHALL NOT affect Direct or Hybrid track outputs.
Scenario: Gap filling only activates for OCR track
- GIVEN a document processing task
- WHEN the processing track is OCR
- THEN the system SHALL evaluate and apply gap filling as needed
- AND produce enhanced output with supplemented content
Scenario: Direct track is unaffected
- GIVEN a document processing task with Direct track
- WHEN the task is processed
- THEN the system SHALL NOT invoke any gap filling logic
- AND produce output identical to current Direct track behavior
Scenario: Hybrid track is unaffected
- GIVEN a document processing task with Hybrid track
- WHEN the task is processed
- THEN the system SHALL NOT invoke gap filling logic
- AND use existing Hybrid track processing pipeline
Requirement: Gap Filling Configuration
The system SHALL provide configurable parameters for gap filling behavior.
Scenario: Gap filling can be disabled via configuration
- GIVEN gap_filling_enabled is set to false in configuration
- WHEN OCR track processing runs
- THEN the system SHALL skip all gap filling logic
- AND output only PP-StructureV3 results as before
Scenario: Coverage threshold is configurable
- GIVEN gap_filling_coverage_threshold is set to 0.8
- WHEN PP-StructureV3 coverage is 75%
- THEN the system SHALL activate gap filling
- AND supplement uncovered regions
Scenario: IoU thresholds are configurable
- GIVEN custom IoU thresholds configured:
- gap_filling_iou_threshold: 0.2
- gap_filling_dedup_iou_threshold: 0.6
- WHEN evaluating coverage and deduplication
- THEN the system SHALL use the configured values
- AND apply them consistently throughout gap filling process
Scenario: Confidence threshold is configurable
- GIVEN gap_filling_confidence_threshold is set to 0.5
- WHEN supplementing Raw OCR regions
- THEN the system SHALL only include regions with confidence >= 0.5
- AND filter out lower confidence regions
Requirement: Layout Model Selection
The system SHALL allow users to select a layout detection model optimized for their document type, providing a simple choice between pre-configured models instead of manual parameter tuning.
Scenario: User selects Chinese document model
- GIVEN a user is processing Chinese business documents (forms, contracts, invoices)
- WHEN the user selects "Chinese Document Model" (PP-DocLayout-S)
- THEN the OCR engine SHALL use the PP-DocLayout-S layout detection model
- AND the model SHALL be optimized for 23 Chinese document element types
- AND table and form detection accuracy SHALL be improved over the default model
Scenario: User selects standard model for English documents
- GIVEN a user is processing English academic papers or reports
- WHEN the user selects "Standard Model" (PubLayNet-based)
- THEN the OCR engine SHALL use the default PubLayNet-based layout detection model
- AND the model SHALL be optimized for English document layouts
Scenario: User selects CDLA model for specialized Chinese layout
- GIVEN a user is processing Chinese documents with complex layouts
- WHEN the user selects "CDLA Model"
- THEN the OCR engine SHALL use the picodet_lcnet_x1_0_fgd_layout_cdla model
- AND the model SHALL provide specialized Chinese document layout analysis
Scenario: Layout model is sent via API request
- GIVEN a frontend application with model selection UI
- WHEN the user starts task processing with a selected model
- THEN the frontend SHALL send the model choice in the request body:
POST /api/v2/tasks/{task_id}/start { "use_dual_track": true, "force_track": "ocr", "language": "ch", "layout_model": "chinese" } - AND the backend SHALL configure PP-StructureV3 with the corresponding model
Scenario: Default model when not specified
- GIVEN an API request without
layout_modelparameter - WHEN the task is started
- THEN the system SHALL use "chinese" (PP-DocLayout-S) as the default model
- AND processing SHALL work correctly without requiring model selection
Scenario: Invalid model name is rejected
- GIVEN a request with an invalid
layout_modelvalue - WHEN the user sends
layout_model: "invalid_model" - THEN the API SHALL return 422 Validation Error
- AND provide a clear error message listing valid model options
Requirement: Layout Model Selection UI
The frontend SHALL provide a simple, user-friendly interface for selecting layout detection models with clear descriptions of each option.
Scenario: Model options are displayed with descriptions
- GIVEN the model selection UI is displayed
- WHEN the user views the available options
- THEN the UI SHALL show the following options:
- "Chinese Document Model (Recommended)" - for Chinese forms, contracts, invoices
- "Standard Model" - for English academic papers, reports
- "CDLA Model" - for specialized Chinese layout analysis
- AND each option SHALL have a brief description of its use case
Scenario: Chinese model is selected by default
- GIVEN the user opens the task processing interface
- WHEN the model selection is displayed
- THEN "Chinese Document Model" SHALL be pre-selected as the default
- AND the user MAY change the selection before starting processing
Scenario: Model selection is visible only for OCR track
- GIVEN a document processing interface
- WHEN the user selects processing track
- THEN layout model selection SHALL be shown ONLY when OCR track is selected or auto-detected
- AND SHALL be hidden for Direct track (which does not use PP-StructureV3)
Requirement: Model Cache Cleanup
The system SHALL provide documentation for cleaning up unused model caches to optimize storage space.
Scenario: User wants to free disk space after model upgrade
- WHEN the user has upgraded from older models (PP-DocLayout-S, SLANet) to newer models
- THEN the documentation SHALL explain how to delete unused cached models from
~/.paddlex/official_models/ - AND list which model directories can be safely removed
Requirement: Cell Over-Detection Filtering
The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells.
Scenario: Cell density exceeds threshold
- GIVEN a table detected by PP-StructureV3 with cell_boxes
- WHEN cell density exceeds 3.0 cells per 10,000 px²
- THEN the system SHALL flag the table as over-detected
- AND reclassify the table as a TEXT element
Scenario: Average cell area below threshold
- GIVEN a table detected by PP-StructureV3
- WHEN average cell area is less than 3,000 px²
- THEN the system SHALL flag the table as over-detected
- AND reclassify the table as a TEXT element
Scenario: Cell height too small
- GIVEN a table with height H and N cells
- WHEN (H / N) is less than 10 pixels
- THEN the system SHALL flag the table as over-detected
- AND reclassify the table as a TEXT element
Scenario: Valid tables are preserved
- GIVEN a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10)
- WHEN validation is applied
- THEN the table SHALL be preserved unchanged
- AND all cell_boxes SHALL be retained
Requirement: Table-to-Text Reclassification
The system SHALL convert over-detected tables to TEXT elements while preserving content.
Scenario: Table content is preserved
- GIVEN a table flagged for reclassification
- WHEN converting to TEXT element
- THEN the system SHALL extract text content from table HTML
- AND preserve the original bounding box
- AND set element type to TEXT
Scenario: Reading order is recalculated
- GIVEN tables have been reclassified as TEXT
- WHEN assembling the final page structure
- THEN the system SHALL recalculate reading order
- AND sort elements by y0 then x0 coordinates
Requirement: Validation Configuration
The system SHALL provide configurable thresholds for cell validation.
Scenario: Default thresholds are applied
- GIVEN no custom configuration is provided
- WHEN validating tables
- THEN the system SHALL use default thresholds:
- max_cell_density: 3.0 cells/10000px²
- min_avg_cell_area: 3000 px²
- min_cell_height: 10 px
Scenario: Custom thresholds can be configured
- GIVEN custom validation thresholds in configuration
- WHEN validating tables
- THEN the system SHALL use the custom values
- AND apply them consistently to all pages