egg/OCR - OCR

egg/OCR

Author	SHA1	Message	Date
egg	6e050eb540	fix: OCR track table data format and image cropping Table data format fixes (ocr_to_unified_converter.py): - Fix ElementType string conversion using value-based lookup - Add content-based HTML table detection (reclassify TEXT to TABLE) - Use BeautifulSoup for robust HTML table parsing - Generate TableData with fully populated cells arrays Image cropping for OCR track (pp_structure_enhanced.py): - Add _crop_and_save_image method for extracting image regions - Pass source_image_path to _process_parsing_res_list - Return relative filename (not full path) for saved_path - Consistent with Direct Track image saving pattern Also includes: - Add beautifulsoup4 to requirements.txt - Add architecture overview documentation - Archive fix-ocr-track-table-data-format proposal (22/24 tasks) Known issues: OCR track images are restored but still have quality issues that will be addressed in a follow-up proposal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 18:48:15 +08:00
egg	2d50c128f7	feat: implement core dual-track processing infrastructure Added foundation for dual-track document processing: 1. UnifiedDocument Model (backend/app/models/unified_document.py) - Common output format for both OCR and direct extraction - Comprehensive element types (23+ types from PP-StructureV3) - BoundingBox, StyleInfo, TableData structures - Backward compatibility with legacy format 2. DocumentTypeDetector Service (backend/app/services/document_type_detector.py) - Intelligent document type detection using python-magic - PDF editability analysis using PyMuPDF - Processing track recommendation with confidence scores - Support for PDF, images, Office docs, and text files 3. DirectExtractionEngine Service (backend/app/services/direct_extraction_engine.py) - Fast extraction from editable PDFs using PyMuPDF - Preserves fonts, colors, and exact positioning - Native and positional table detection - Image extraction with coordinates - Hyperlink and metadata extraction 4. Dependencies - Added PyMuPDF>=1.23.0 for PDF extraction - Added pdfplumber>=0.10.0 as fallback - Added python-magic-bin>=0.4.14 for file detection Next: Integrate with OCR service for complete dual-track processing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-18 20:17:50 +08:00
egg	dc31121555	fix: correct OCR coordinate scaling by inferring dimensions from bbox Critical Fix: The previous implementation incorrectly calculated scale factors because calculate_page_dimensions() was prioritizing source file dimensions over OCR coordinate analysis, resulting in scale=1.0 when it should have been ~0.27. Root Cause: - PaddleOCR processes PDFs at high resolution (e.g., 2185x3500 pixels) - OCR bbox coordinates are in this high-res space - calculate_page_dimensions() was returning source PDF size (595x842) instead - This caused scale_w=1.0, scale_h=1.0, placing all text out of bounds Solution: 1. Rewrite calculate_page_dimensions() to: - Accept full ocr_data instead of just text_regions - Process both text_regions AND layout elements - Handle polygon bbox format [[x,y], ...] correctly - Infer OCR dimensions from max bbox coordinates FIRST - Only fallback to source file dimensions if inference fails 2. Separate OCR dimensions from target PDF dimensions: - ocr_width/height: Inferred from bbox (e.g., 2185x3280) - target_width/height: From source file (e.g., 595x842) - scale_w = target_width / ocr_width (e.g., 0.272) - scale_h = target_height / ocr_height (e.g., 0.257) 3. Add PyPDF2 support: - Extract dimensions from source PDF files - Required for getting target PDF size Changes: - backend/app/services/pdf_generator_service.py: - Fix calculate_page_dimensions() to infer from bbox first - Add PyPDF2 support in get_original_page_size() - Simplify scaling logic (removed ocr_dimensions dependency) - Update all drawing calls to use target_height instead of page_height - requirements.txt: - Add PyPDF2>=3.0.0 for PDF dimension extraction - backend/test_bbox_scaling.py: - Add comprehensive test for high-res OCR → A4 PDF scenario - Validates proper scale factor calculation (0.272 x 0.257) Test Results: ✓ OCR dimensions correctly inferred: 2185.0 x 3280.0 ✓ Target PDF dimensions extracted: 595.3 x 841.9 ✓ Scale factors correct: X=0.272, Y=0.257 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-17 21:01:38 +08:00
egg	fa1abcd8e6	feat: implement layout-preserving PDF generation with table reconstruction Major Features: - Add PDF generation service with Chinese font support - Parse HTML tables from PP-StructureV3 and rebuild with ReportLab - Extract table text for translation purposes - Auto-filter text regions inside tables to avoid overlaps Backend Changes: 1. pdf_generator_service.py (NEW) - HTMLTableParser: Parse HTML tables to extract structure - PDFGeneratorService: Generate layout-preserving PDFs - Coordinate transformation: OCR (top-left) → PDF (bottom-left) - Font size heuristics: 75% of bbox height with width checking - Table reconstruction: Parse HTML → ReportLab Table - Image embedding: Extract bbox from filenames 2. ocr_service.py - Add _extract_table_text() for translation support - Add output_dir parameter to save images to result directory - Extract bbox from image filenames (img_in_table_box_x1_y1_x2_y2.jpg) 3. tasks.py - Update process_task_ocr to use save_results() with PDF generation - Fix download_pdf endpoint to use database-stored PDF paths - Support on-demand PDF generation from JSON 4. config.py - Add chinese_font_path configuration - Add pdf_enable_bbox_debug flag Frontend Changes: 1. PDFViewer.tsx (NEW) - React PDF viewer with zoom and pagination - Memoized file config to prevent unnecessary reloads 2. TaskDetailPage.tsx & ResultsPage.tsx - Integrate PDF preview and download 3. main.tsx - Configure PDF.js worker via CDN 4. vite.config.ts - Add host: '0.0.0.0' for network access - Use VITE_API_URL environment variable for backend proxy Dependencies: - reportlab: PDF generation library - Noto Sans SC font: Chinese character support 🤖 Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-17 20:21:56 +08:00
egg	36944117f4	fix: update setup script to install PaddlePaddle GPU version from official source Changes to setup_dev_env.sh: - Add support for CUDA 13.x (install CUDA 12.x compatible version) - Use official PaddlePaddle source for GPU versions - Install paddlepaddle-gpu==3.0.0b2 from official index - CUDA 13.x: use cu123 package (backward compatible) - CUDA 12.x: use cu123 package - CUDA 11.7+: use cu118 package - CUDA 11.2-11.6: use cu117 package Changes to requirements.txt: - Comment out paddlepaddle dependency - Let setup script handle GPU/CPU version installation This fixes the issue where pip installed CPU-only paddlepaddle 3.2.1 instead of GPU version, causing GPU acceleration to be unavailable. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-14 09:35:12 +08:00
beabigegg	0f81d5e70b	feat: Docker化部署 - 單容器架構轉換將 Tool_OCR 從 macOS conda 環境轉換為 Docker 單容器部署方案。前後端整合於同一容器，通過 Nginx 反向代理，僅對外暴露單一端口。 ## 新增功能 - Docker 單容器架構（Frontend + Backend + Nginx） - 多階段構建優化鏡像大小 - Supervisor 進程管理 - 健康檢查機制 - 完整部署文檔 ## 技術細節 - 對外端口：12015（原 12010 已被佔用） - 內部架構：Nginx(12015) → FastAPI(8000) - 前端靜態文件由 Nginx 直接服務 - API 請求通過 Nginx 反向代理 ## 系統依賴完善 - libmagic1：文件類型檢測 - LibreOffice：Office 文檔轉換 - paddlex[ocr]：PP-StructureV3 版面分析 - 中日韓字體支援 ## 配置調整 - 環境變數路徑：macOS 路徑 → 容器絕對路徑 - 前端 API URL：修正為統一端口 12015 - Pip 安裝：延長超時至 600 秒，重試 5 次 - CRLF 轉換：自動處理 Windows 換行符 ## 清理 - 移除臨時文檔（API_FIX_SUMMARY.md 等 7 個文檔） 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-13 13:12:59 +08:00
beabigegg	da700721fa	first	2025-11-12 22:53:17 +08:00

7 Commits