fix: OCR track table data format and image cropping
Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
84
docs/architecture-overview.md
Normal file
84
docs/architecture-overview.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# Tool_OCR 架構說明與 UML
|
||||
|
||||
本文件概覽 Tool_OCR 的主要組件、資料流與雙軌處理(OCR / Direct),並附上 UML 關係圖以協助判斷改動的影響範圍。
|
||||
|
||||
## 系統分層與重點元件
|
||||
- **API 層(FastAPI)**:`app/main.py` 啟動應用、掛載路由(`routers/auth.py`, `routers/tasks.py`, `routers/admin.py`),並在 lifespan 初始化記憶體管理、服務池與併發控制。
|
||||
- **任務/檔案管理**:`task_service.py` 與 `file_access_service.py` 掌管任務 CRUD、路徑與權限;`Task` / `TaskFile` 模型紀錄結果檔路徑。
|
||||
- **核心處理服務**:`OCRService`(`services/ocr_service.py`)負責雙軌路由與 OCR;整合偵測、直抽、OCR、統一格式轉換、匯出與 PDF 生成。
|
||||
- **雙軌偵測/直抽**:`DocumentTypeDetector` 判斷走 Direct 或 OCR;`DirectExtractionEngine` 使用 PyMuPDF 直接抽取文字/表格/圖片(必要時觸發混合模式補抽圖片)。
|
||||
- **OCR 解析**:PaddleOCR + `PPStructureEnhanced` 抽取 23 類元素;`OCRToUnifiedConverter` 轉成 `UnifiedDocument` 統一格式。
|
||||
- **匯出/呈現**:`UnifiedDocumentExporter` 產出 JSON/Markdown;`pdf_generator_service.py` 產生版面保持 PDF;前端透過 `/api/v2/tasks/{id}/download/*` 取得。
|
||||
- **資源控管**:`memory_manager.py`(MemoryGuard、prediction semaphore、模型生命週期),`service_pool.py`(`OCRService` 池)避免多重載模與 GPU 爆滿。
|
||||
|
||||
## 處理流程(任務層級)
|
||||
1. **上傳**:`POST /api/v2/upload` 建立 Task 並寫檔到 `uploads/`(含 SHA256、檔案資訊)。
|
||||
2. **啟動**:`POST /api/v2/tasks/{id}/start`(`ProcessingOptions`,可含 `pp_structure_params`)→ 背景 `process_task_ocr` 取得服務池中的 `OCRService`。
|
||||
3. **軌道決策**:`DocumentTypeDetector.detect` 分析 MIME、PDF 文字覆蓋率或 Office 轉 PDF 後的抽樣結果:
|
||||
- **Direct**:`DirectExtractionEngine.extract` 產出 `UnifiedDocument`;若偵測缺圖則啟用混合模式呼叫 OCR 抽圖或渲染 inline 圖。
|
||||
- **OCR**:`process_file_traditional` → PaddleOCR + PP-Structure → `OCRToUnifiedConverter.convert` 產生 `UnifiedDocument`。
|
||||
- 以 `ProcessingTrack` 記錄 `ocr` / `direct` / `hybrid`,處理時間與統計寫入 metadata。
|
||||
4. **輸出保存**:`UnifiedDocumentExporter` 寫 `_result.json`(含 metadata、statistics)與 `_output.md`;`pdf_generator_service` 產出 `_layout.pdf`;路徑回寫 DB。
|
||||
5. **下載/檢視**:前端透過 `/download/json|markdown|pdf|unified` 取檔;`/metadata` 讀 JSON metadata 回傳統計與 `processing_track`。
|
||||
|
||||
## 前端流程摘要
|
||||
- `UploadPage`:呼叫 `apiClientV2.uploadFile`,首個 `task_id` 存於 `uploadStore.batchId`。
|
||||
- `ProcessingPage`:對 `batchId` 呼叫 `startTask`(預設 `use_dual_track=true`,支援自訂 `pp_structure_params`),輪詢狀態。
|
||||
- `ResultsPage` / `TaskDetailPage`:使用 `getTask` 與 `getProcessingMetadata` 顯示 `processing_track`、統計並提供 JSON/Markdown/PDF/Unified 下載。
|
||||
- `TaskHistoryPage`:列出任務、支援重新啟動、重試、下載。
|
||||
|
||||
## 共同模組與影響點
|
||||
- **UnifiedDocument**(`models/unified_document.py`)為 Direct/OCR 共用輸出格式;所有匯出/PDF/前端 track 顯示依賴其欄位與 metadata。
|
||||
- **服務池/記憶體守護**:Direct 與 OCR 共用同一 `OCRService` 實例池與 MemoryGuard;新增資源或改動需確保遵循 acquire/release、清理與 semaphore 規則。
|
||||
- **偵測閾值變更**:`DocumentTypeDetector` 參數調整會影響 Direct 與 OCR 分流比例,間接改變 GPU 載荷與結果格式。
|
||||
- **匯出/PDF**:任何 UnifiedDocument 結構變動會影響 JSON/Markdown/PDF 產出與前端下載/預覽;需同步維護轉換與匯出器。
|
||||
|
||||
## UML 關係圖(Mermaid)
|
||||
```mermaid
|
||||
classDiagram
|
||||
class TasksRouter {
|
||||
+upload_file()
|
||||
+start_task()
|
||||
+download_json/markdown/pdf/unified()
|
||||
+get_metadata()
|
||||
}
|
||||
class TaskService {+create_task(); +update_task_status(); +get_task_by_id()}
|
||||
class FileAccessService
|
||||
class OCRService {
|
||||
+process()
|
||||
+process_with_dual_track()
|
||||
+process_file_traditional()
|
||||
+save_results()
|
||||
}
|
||||
class DocumentTypeDetector {+detect()}
|
||||
class DirectExtractionEngine {+extract(); +check_document_for_missing_images()}
|
||||
class OCRToUnifiedConverter {+convert()}
|
||||
class UnifiedDocument
|
||||
class UnifiedDocumentExporter {+export_to_json(); +export_to_markdown()}
|
||||
class PDFGeneratorService {+generate_layout_pdf(); +generate_from_unified_document()}
|
||||
class ServicePool {+acquire(); +release()}
|
||||
class MemoryManager <<singleton>>
|
||||
class OfficeConverter {+convert_to_pdf()}
|
||||
class PPStructureEnhanced {+analyze_with_full_structure()}
|
||||
|
||||
TasksRouter --> TaskService
|
||||
TasksRouter --> FileAccessService
|
||||
TasksRouter --> OCRService : background process via process_task_ocr
|
||||
OCRService --> DocumentTypeDetector : track recommendation
|
||||
OCRService --> DirectExtractionEngine : direct track
|
||||
OCRService --> OCRToUnifiedConverter : OCR track result -> UnifiedDocument
|
||||
OCRService --> OfficeConverter : Office -> PDF
|
||||
OCRService --> PPStructureEnhanced : layout analysis (PP-StructureV3)
|
||||
OCRService --> UnifiedDocumentExporter : persist results
|
||||
OCRService --> PDFGeneratorService : layout-preserving PDF
|
||||
OCRService --> ServicePool : acquired instance
|
||||
ServicePool --> MemoryManager : model lifecycle / GPU guard
|
||||
UnifiedDocumentExporter --> UnifiedDocument
|
||||
PDFGeneratorService --> UnifiedDocument
|
||||
```
|
||||
|
||||
## 影響判斷指引
|
||||
- **改 Direct/偵測邏輯**:會改變 `processing_track` 與結果格式;前端顯示與下載 JSON/Markdown/PDF 仍依賴 UnifiedDocument,需驗證匯出與 PDF 生成。
|
||||
- **改 OCR/PP-Structure 參數**:僅影響 OCR track;Direct track 不受 `pp_structure_params` 影響(符合 spec),需維持 `processing_track` 填寫。
|
||||
- **改 UnifiedDocument 結構/統計**:需同步 `UnifiedDocumentExporter`、`pdf_generator_service`、前端 `getProcessingMetadata`/下載端點。
|
||||
- **改資源控管**:服務池或 MemoryGuard 調整會同時影響 Direct/OCR 執行時序與穩定性,須確保 acquire/release 與 semaphore 不被破壞。
|
||||
Reference in New Issue
Block a user