feat: add OCR to UnifiedDocument converter for PP-StructureV3 integration

Implements the converter that transforms PP-StructureV3 OCR results into
the UnifiedDocument format, enabling consistent output for both OCR and
direct extraction tracks.

- Create OCRToUnifiedConverter class with full element type mapping
- Handle both enhanced (parsing_res_list) and standard markdown results
- Support 4-point and simple bbox formats for coordinates
- Establish element relationships (captions, lists, headers)
- Integrate converter into OCR service dual-track processing
- Update tasks.md marking section 3.3 complete

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-19 08:05:20 +08:00
parent 062cb1f423
commit a3a6fbe58b
4 changed files with 1172 additions and 29 deletions

View File

@@ -42,15 +42,15 @@
- [ ] 3.1.2 Enable batch processing for GPU efficiency
- [ ] 3.1.3 Configure memory management settings
- [ ] 3.1.4 Set up model caching
- [ ] 3.2 Enhance OCR service to use parsing_res_list
- [ ] 3.2.1 Replace markdown extraction with parsing_res_list
- [ ] 3.2.2 Extract all 23 element types
- [ ] 3.2.3 Preserve bbox coordinates from PP-StructureV3
- [ ] 3.2.4 Maintain reading order information
- [ ] 3.3 Create OCR to UnifiedDocument converter
- [ ] 3.3.1 Map PP-StructureV3 elements to UnifiedDocument
- [ ] 3.3.2 Handle complex nested structures
- [ ] 3.3.3 Preserve all metadata
- [x] 3.2 Enhance OCR service to use parsing_res_list
- [x] 3.2.1 Replace markdown extraction with parsing_res_list
- [x] 3.2.2 Extract all 23 element types
- [x] 3.2.3 Preserve bbox coordinates from PP-StructureV3
- [x] 3.2.4 Maintain reading order information
- [x] 3.3 Create OCR to UnifiedDocument converter
- [x] 3.3.1 Map PP-StructureV3 elements to UnifiedDocument
- [x] 3.3.2 Handle complex nested structures
- [x] 3.3.3 Preserve all metadata
## 4. Unified Processing Pipeline
- [x] 4.1 Update main OCR service for dual-track processing