chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions
--- a/openspec/changes/improve-ocr-track-algorithm/specs/ocr-processing/spec.md
+++ b/openspec/changes/improve-ocr-track-algorithm/specs/ocr-processing/spec.md
@@ -0,0 +1,142 @@
+## MODIFIED Requirements
+
+### Requirement: OCR Track Gap Filling with Raw OCR Regions
+
+The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.
+
+#### Scenario: Gap filling activates when coverage is low
+- **GIVEN** an OCR track processing task
+- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
+- **THEN** the system SHALL activate gap filling
+- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element
+- **AND** supplement these regions as TEXT elements in the output
+
+#### Scenario: Coverage is determined by IoA (Intersection over Area)
+- **GIVEN** a Raw OCR text region with bounding box
+- **WHEN** checking if the region is covered by PP-StructureV3
+- **THEN** the region SHALL be considered covered if IoA (intersection area / OCR box area) exceeds the type-specific threshold
+- **AND** IoA SHALL be used instead of IoU because it correctly measures "small box contained in large box" relationship
+- **AND** regions not meeting the IoA criterion SHALL be marked as uncovered
+
+#### Scenario: Element-type-specific IoA thresholds are applied
+- **GIVEN** a Raw OCR region being evaluated for coverage
+- **WHEN** comparing against PP-StructureV3 elements of different types
+- **THEN** the system SHALL apply different IoA thresholds:
+  - TEXT, TITLE, HEADER, FOOTER: IoA > 0.6 (tolerates boundary errors)
+  - TABLE: IoA > 0.1 (strict filtering to preserve table structure)
+  - FIGURE, IMAGE: IoA > 0.8 (preserves text within figures like axis labels)
+- **AND** a region is considered covered if it meets the threshold for ANY overlapping element
+
+#### Scenario: Only TEXT elements are supplemented
+- **GIVEN** uncovered Raw OCR regions identified for supplementation
+- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
+- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements
+- **AND** only supplement regions as TEXT type to preserve structural integrity
+
+#### Scenario: Supplemented regions meet confidence threshold
+- **GIVEN** Raw OCR regions to be supplemented
+- **WHEN** a region has confidence score below 0.3
+- **THEN** the system SHALL skip that region
+- **AND** only supplement regions with confidence >= 0.3
+
+#### Scenario: Deduplication uses IoA instead of IoU
+- **GIVEN** a Raw OCR region being considered for supplementation
+- **WHEN** the region has IoA > 0.5 with any existing PP-StructureV3 TEXT element
+- **THEN** the system SHALL skip that region to prevent duplicate text
+- **AND** the original PP-StructureV3 element SHALL be preserved
+
+#### Scenario: Reading order is recalculated after gap filling
+- **GIVEN** supplemented elements have been added to the page
+- **WHEN** assembling the final element list
+- **THEN** the system SHALL recalculate reading order for the entire page
+- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right)
+- **AND** ensure logical document flow is maintained
+
+#### Scenario: Coordinate alignment with ocr_dimensions
+- **GIVEN** Raw OCR processing may involve image resizing
+- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox
+- **THEN** the system SHALL use ocr_dimensions to normalize coordinates
+- **AND** ensure both sources reference the same coordinate space
+- **AND** prevent coverage misdetection due to scale differences
+
+#### Scenario: Supplemented elements have complete metadata
+- **GIVEN** a Raw OCR region being added as supplemented element
+- **WHEN** creating the DocumentElement
+- **THEN** the element SHALL include page_number
+- **AND** include confidence score from Raw OCR
+- **AND** include original bbox coordinates
+- **AND** optionally include source indicator for debugging
+
+### Requirement: Gap Filling Configuration
+
+The system SHALL provide configurable parameters for gap filling behavior.
+
+#### Scenario: Gap filling can be disabled via configuration
+- **GIVEN** gap_filling_enabled is set to false in configuration
+- **WHEN** OCR track processing runs
+- **THEN** the system SHALL skip all gap filling logic
+- **AND** output only PP-StructureV3 results as before
+
+#### Scenario: Coverage threshold is configurable
+- **GIVEN** gap_filling_coverage_threshold is set to 0.8
+- **WHEN** PP-StructureV3 coverage is 75%
+- **THEN** the system SHALL activate gap filling
+- **AND** supplement uncovered regions
+
+#### Scenario: IoA thresholds are configurable per element type
+- **GIVEN** custom IoA thresholds configured:
+  - gap_filling_ioa_threshold_text: 0.6
+  - gap_filling_ioa_threshold_table: 0.1
+  - gap_filling_ioa_threshold_figure: 0.8
+  - gap_filling_dedup_ioa_threshold: 0.5
+- **WHEN** evaluating coverage and deduplication
+- **THEN** the system SHALL use the configured values
+- **AND** apply them consistently throughout gap filling process
+
+#### Scenario: Confidence threshold is configurable
+- **GIVEN** gap_filling_confidence_threshold is set to 0.5
+- **WHEN** supplementing Raw OCR regions
+- **THEN** the system SHALL only include regions with confidence >= 0.5
+- **AND** filter out lower confidence regions
+
+#### Scenario: Boundary shrinking reduces edge duplicates
+- **GIVEN** gap_filling_shrink_pixels is set to 1
+- **WHEN** evaluating coverage with IoA
+- **THEN** the system SHALL shrink OCR bounding boxes inward by 1 pixel on each side
+- **AND** this reduces false "uncovered" detection at region boundaries
+
+## ADDED Requirements
+
+### Requirement: Use PP-StructureV3 Internal OCR Results
+
+The system SHALL preferentially use PP-StructureV3's internal OCR results (`overall_ocr_res`) instead of running a separate Raw OCR inference.
+
+#### Scenario: Extract overall_ocr_res from PP-StructureV3
+- **GIVEN** PP-StructureV3 processing completes
+- **WHEN** the result contains `json['res']['overall_ocr_res']`
+- **THEN** the system SHALL extract OCR regions from:
+  - `dt_polys`: detection box polygons
+  - `rec_texts`: recognized text strings
+  - `rec_scores`: confidence scores
+- **AND** convert these to the standard TextRegion format for gap filling
+
+#### Scenario: Skip separate Raw OCR when overall_ocr_res is available
+- **GIVEN** gap_filling_use_overall_ocr is true (default)
+- **WHEN** PP-StructureV3 result contains overall_ocr_res
+- **THEN** the system SHALL NOT execute separate PaddleOCR inference
+- **AND** use the extracted overall_ocr_res as the OCR source
+- **AND** this reduces total inference time by approximately 50%
+
+#### Scenario: Fallback to separate Raw OCR when needed
+- **GIVEN** gap_filling_use_overall_ocr is false OR overall_ocr_res is missing
+- **WHEN** gap filling is activated
+- **THEN** the system SHALL execute separate PaddleOCR inference as before
+- **AND** use the separate OCR results for gap filling
+- **AND** this maintains backward compatibility
+
+#### Scenario: Coordinate consistency is guaranteed
+- **GIVEN** overall_ocr_res is extracted from PP-StructureV3
+- **WHEN** comparing with PP-StructureV3 layout elements
+- **THEN** both SHALL use the same coordinate system
+- **AND** no additional coordinate alignment is needed
+- **AND** this prevents scale mismatch issues