chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions
--- a/openspec/changes/improve-ocr-track-algorithm/proposal.md
+++ b/openspec/changes/improve-ocr-track-algorithm/proposal.md
@@ -0,0 +1,49 @@
+# Change: Improve OCR Track Algorithm Based on PP-StructureV3 Best Practices
+
+## Why
+
+目前 OCR Track 的 Gap Filling 演算法使用 **IoU (Intersection over Union)** 判斷 OCR 文字是否被 Layout 區域覆蓋。根據 PaddleX 官方文件 (paddle_review.md) 建議，應改用 **IoA (Intersection over Area)** 才能正確判斷「小框是否被大框包含」的非對稱關係。此外，現行使用統一閾值處理所有元素類型，但不同類型應有不同閾值策略。
+
+## What Changes
+
+1. **IoU → IoA 演算法變更**: 將 `gap_filling_service.py` 中的覆蓋判定從 IoU 改為 IoA
+2. **動態閾值策略**: 依元素類型 (TEXT, TABLE, FIGURE) 使用不同的 IoA 閾值
+3. **使用 PP-StructureV3 內建 OCR**: 改用 `overall_ocr_res` 取代獨立執行 Raw OCR，節省推理時間並確保座標一致
+4. **邊界收縮處理**: OCR 框內縮 1-2 px 避免邊緣重複渲染
+
+## Impact
+
+- Affected specs: `ocr-processing`
+- Affected code:
+  - `backend/app/services/gap_filling_service.py` - 核心演算法變更
+  - `backend/app/services/ocr_service.py` - 改用 `overall_ocr_res`
+  - `backend/app/services/processing_orchestrator.py` - 調整 OCR 資料來源
+  - `backend/app/core/config.py` - 新增元素類型閾值設定
+
+## Technical Details
+
+### 1. IoA vs IoU
+
+```
+IoU = 交集面積 / 聯集面積  (對稱，用於判斷兩框是否指向同物體)
+IoA = 交集面積 / OCR框面積 (非對稱，用於判斷小框是否被大框包含)
+```
+
+當 Layout 框遠大於 OCR 框時，IoU 會過小導致誤判為「未覆蓋」。
+
+### 2. 動態閾值建議
+
+| 元素類型 | IoA 閾值 | 說明 |
+|---------|---------|------|
+| TEXT/TITLE | 0.6 | 容忍邊界誤差 |
+| TABLE | 0.1 | 嚴格過濾，避免破壞表格結構 |
+| FIGURE | 0.8 | 保留圖中文字 (如軸標籤) |
+
+### 3. overall_ocr_res 驗證結果
+
+已確認 PP-StructureV3 的 `json['res']['overall_ocr_res']` 包含：
+- `dt_polys`: 檢測框座標 (polygon 格式)
+- `rec_texts`: 識別文字
+- `rec_scores`: 識別信心度
+
+測試結果顯示與獨立執行 Raw OCR 的結果數量相同 (59 regions)，可安全替換。
--- a/openspec/changes/improve-ocr-track-algorithm/specs/ocr-processing/spec.md
+++ b/openspec/changes/improve-ocr-track-algorithm/specs/ocr-processing/spec.md
@@ -0,0 +1,142 @@
+## MODIFIED Requirements
+
+### Requirement: OCR Track Gap Filling with Raw OCR Regions
+
+The system SHALL detect and fill gaps in PP-StructureV3 output by supplementing with Raw OCR text regions when significant content loss is detected.
+
+#### Scenario: Gap filling activates when coverage is low
+- **GIVEN** an OCR track processing task
+- **WHEN** PP-StructureV3 outputs elements that cover less than 70% of Raw OCR text regions
+- **THEN** the system SHALL activate gap filling
+- **AND** identify Raw OCR regions not covered by any PP-StructureV3 element
+- **AND** supplement these regions as TEXT elements in the output
+
+#### Scenario: Coverage is determined by IoA (Intersection over Area)
+- **GIVEN** a Raw OCR text region with bounding box
+- **WHEN** checking if the region is covered by PP-StructureV3
+- **THEN** the region SHALL be considered covered if IoA (intersection area / OCR box area) exceeds the type-specific threshold
+- **AND** IoA SHALL be used instead of IoU because it correctly measures "small box contained in large box" relationship
+- **AND** regions not meeting the IoA criterion SHALL be marked as uncovered
+
+#### Scenario: Element-type-specific IoA thresholds are applied
+- **GIVEN** a Raw OCR region being evaluated for coverage
+- **WHEN** comparing against PP-StructureV3 elements of different types
+- **THEN** the system SHALL apply different IoA thresholds:
+  - TEXT, TITLE, HEADER, FOOTER: IoA > 0.6 (tolerates boundary errors)
+  - TABLE: IoA > 0.1 (strict filtering to preserve table structure)
+  - FIGURE, IMAGE: IoA > 0.8 (preserves text within figures like axis labels)
+- **AND** a region is considered covered if it meets the threshold for ANY overlapping element
+
+#### Scenario: Only TEXT elements are supplemented
+- **GIVEN** uncovered Raw OCR regions identified for supplementation
+- **WHEN** PP-StructureV3 has detected TABLE, IMAGE, FIGURE, FLOWCHART, HEADER, or FOOTER elements
+- **THEN** the system SHALL NOT supplement regions that overlap with these structural elements
+- **AND** only supplement regions as TEXT type to preserve structural integrity
+
+#### Scenario: Supplemented regions meet confidence threshold
+- **GIVEN** Raw OCR regions to be supplemented
+- **WHEN** a region has confidence score below 0.3
+- **THEN** the system SHALL skip that region
+- **AND** only supplement regions with confidence >= 0.3
+
+#### Scenario: Deduplication uses IoA instead of IoU
+- **GIVEN** a Raw OCR region being considered for supplementation
+- **WHEN** the region has IoA > 0.5 with any existing PP-StructureV3 TEXT element
+- **THEN** the system SHALL skip that region to prevent duplicate text
+- **AND** the original PP-StructureV3 element SHALL be preserved
+
+#### Scenario: Reading order is recalculated after gap filling
+- **GIVEN** supplemented elements have been added to the page
+- **WHEN** assembling the final element list
+- **THEN** the system SHALL recalculate reading order for the entire page
+- **AND** sort elements by y0 coordinate (top to bottom) then x0 (left to right)
+- **AND** ensure logical document flow is maintained
+
+#### Scenario: Coordinate alignment with ocr_dimensions
+- **GIVEN** Raw OCR processing may involve image resizing
+- **WHEN** comparing Raw OCR bbox with PP-StructureV3 bbox
+- **THEN** the system SHALL use ocr_dimensions to normalize coordinates
+- **AND** ensure both sources reference the same coordinate space
+- **AND** prevent coverage misdetection due to scale differences
+
+#### Scenario: Supplemented elements have complete metadata
+- **GIVEN** a Raw OCR region being added as supplemented element
+- **WHEN** creating the DocumentElement
+- **THEN** the element SHALL include page_number
+- **AND** include confidence score from Raw OCR
+- **AND** include original bbox coordinates
+- **AND** optionally include source indicator for debugging
+
+### Requirement: Gap Filling Configuration
+
+The system SHALL provide configurable parameters for gap filling behavior.
+
+#### Scenario: Gap filling can be disabled via configuration
+- **GIVEN** gap_filling_enabled is set to false in configuration
+- **WHEN** OCR track processing runs
+- **THEN** the system SHALL skip all gap filling logic
+- **AND** output only PP-StructureV3 results as before
+
+#### Scenario: Coverage threshold is configurable
+- **GIVEN** gap_filling_coverage_threshold is set to 0.8
+- **WHEN** PP-StructureV3 coverage is 75%
+- **THEN** the system SHALL activate gap filling
+- **AND** supplement uncovered regions
+
+#### Scenario: IoA thresholds are configurable per element type
+- **GIVEN** custom IoA thresholds configured:
+  - gap_filling_ioa_threshold_text: 0.6
+  - gap_filling_ioa_threshold_table: 0.1
+  - gap_filling_ioa_threshold_figure: 0.8
+  - gap_filling_dedup_ioa_threshold: 0.5
+- **WHEN** evaluating coverage and deduplication
+- **THEN** the system SHALL use the configured values
+- **AND** apply them consistently throughout gap filling process
+
+#### Scenario: Confidence threshold is configurable
+- **GIVEN** gap_filling_confidence_threshold is set to 0.5
+- **WHEN** supplementing Raw OCR regions
+- **THEN** the system SHALL only include regions with confidence >= 0.5
+- **AND** filter out lower confidence regions
+
+#### Scenario: Boundary shrinking reduces edge duplicates
+- **GIVEN** gap_filling_shrink_pixels is set to 1
+- **WHEN** evaluating coverage with IoA
+- **THEN** the system SHALL shrink OCR bounding boxes inward by 1 pixel on each side
+- **AND** this reduces false "uncovered" detection at region boundaries
+
+## ADDED Requirements
+
+### Requirement: Use PP-StructureV3 Internal OCR Results
+
+The system SHALL preferentially use PP-StructureV3's internal OCR results (`overall_ocr_res`) instead of running a separate Raw OCR inference.
+
+#### Scenario: Extract overall_ocr_res from PP-StructureV3
+- **GIVEN** PP-StructureV3 processing completes
+- **WHEN** the result contains `json['res']['overall_ocr_res']`
+- **THEN** the system SHALL extract OCR regions from:
+  - `dt_polys`: detection box polygons
+  - `rec_texts`: recognized text strings
+  - `rec_scores`: confidence scores
+- **AND** convert these to the standard TextRegion format for gap filling
+
+#### Scenario: Skip separate Raw OCR when overall_ocr_res is available
+- **GIVEN** gap_filling_use_overall_ocr is true (default)
+- **WHEN** PP-StructureV3 result contains overall_ocr_res
+- **THEN** the system SHALL NOT execute separate PaddleOCR inference
+- **AND** use the extracted overall_ocr_res as the OCR source
+- **AND** this reduces total inference time by approximately 50%
+
+#### Scenario: Fallback to separate Raw OCR when needed
+- **GIVEN** gap_filling_use_overall_ocr is false OR overall_ocr_res is missing
+- **WHEN** gap filling is activated
+- **THEN** the system SHALL execute separate PaddleOCR inference as before
+- **AND** use the separate OCR results for gap filling
+- **AND** this maintains backward compatibility
+
+#### Scenario: Coordinate consistency is guaranteed
+- **GIVEN** overall_ocr_res is extracted from PP-StructureV3
+- **WHEN** comparing with PP-StructureV3 layout elements
+- **THEN** both SHALL use the same coordinate system
+- **AND** no additional coordinate alignment is needed
+- **AND** this prevents scale mismatch issues
--- a/openspec/changes/improve-ocr-track-algorithm/tasks.md
+++ b/openspec/changes/improve-ocr-track-algorithm/tasks.md
@@ -0,0 +1,54 @@
+## 1. Algorithm Changes (gap_filling_service.py)
+
+### 1.1 IoA Implementation
+- [x] 1.1.1 Add `_calculate_ioa()` method alongside existing `_calculate_iou()`
+- [x] 1.1.2 Modify `_is_region_covered()` to use IoA instead of IoU
+- [x] 1.1.3 Update deduplication logic to use IoA
+
+### 1.2 Dynamic Threshold Strategy
+- [x] 1.2.1 Add element-type-specific thresholds as class constants
+- [x] 1.2.2 Modify `_is_region_covered()` to accept element type parameter
+- [x] 1.2.3 Apply different thresholds based on element type (TEXT: 0.6, TABLE: 0.1, FIGURE: 0.8)
+
+### 1.3 Boundary Shrinking
+- [x] 1.3.1 Add optional `shrink_pixels` parameter to coverage detection
+- [x] 1.3.2 Implement bbox shrinking logic (inward 1-2 px)
+
+## 2. OCR Data Source Changes
+
+### 2.1 Extract overall_ocr_res from PP-StructureV3
+- [x] 2.1.1 Modify `pp_structure_enhanced.py` to extract `overall_ocr_res` from result
+- [x] 2.1.2 Convert `dt_polys` + `rec_texts` + `rec_scores` to TextRegion format
+- [x] 2.1.3 Store extracted OCR in result dict for gap filling
+
+### 2.2 Update Processing Orchestrator
+- [x] 2.2.1 Add option to use `overall_ocr_res` as OCR source
+- [x] 2.2.2 Skip separate Raw OCR inference when using PP-StructureV3's OCR
+- [x] 2.2.3 Maintain backward compatibility with explicit Raw OCR mode
+
+## 3. Configuration Updates
+
+### 3.1 Add Settings (config.py)
+- [x] 3.1.1 Add `gap_filling_ioa_threshold_text: float = 0.6`
+- [x] 3.1.2 Add `gap_filling_ioa_threshold_table: float = 0.1`
+- [x] 3.1.3 Add `gap_filling_ioa_threshold_figure: float = 0.8`
+- [x] 3.1.4 Add `gap_filling_use_overall_ocr: bool = True`
+- [x] 3.1.5 Add `gap_filling_shrink_pixels: int = 1`
+
+## 4. Testing
+
+### 4.1 Unit Tests
+- [ ] 4.1.1 Test IoA calculation with known values
+- [ ] 4.1.2 Test dynamic threshold selection by element type
+- [ ] 4.1.3 Test boundary shrinking edge cases
+
+### 4.2 Integration Tests
+- [ ] 4.2.1 Test with scan.pdf (current problematic file)
+- [ ] 4.2.2 Compare results: old IoU vs new IoA approach
+- [ ] 4.2.3 Verify no duplicate text rendering in output PDF
+- [ ] 4.2.4 Verify table content is not duplicated outside table bounds
+
+## 5. Documentation
+
+- [x] 5.1 Update spec documentation with new algorithm
+- [x] 5.2 Add inline code comments explaining IoA vs IoU