Table data format fixes (ocr_to_unified_converter.py):
- Fix ElementType string conversion using value-based lookup
- Add content-based HTML table detection (reclassify TEXT to TABLE)
- Use BeautifulSoup for robust HTML table parsing
- Generate TableData with fully populated cells arrays
Image cropping for OCR track (pp_structure_enhanced.py):
- Add _crop_and_save_image method for extracting image regions
- Pass source_image_path to _process_parsing_res_list
- Return relative filename (not full path) for saved_path
- Consistent with Direct Track image saving pattern
Also includes:
- Add beautifulsoup4 to requirements.txt
- Add architecture overview documentation
- Archive fix-ocr-track-table-data-format proposal (22/24 tasks)
Known issues: OCR track images are restored but still have quality issues
that will be addressed in a follow-up proposal.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Added foundation for dual-track document processing:
1. UnifiedDocument Model (backend/app/models/unified_document.py)
- Common output format for both OCR and direct extraction
- Comprehensive element types (23+ types from PP-StructureV3)
- BoundingBox, StyleInfo, TableData structures
- Backward compatibility with legacy format
2. DocumentTypeDetector Service (backend/app/services/document_type_detector.py)
- Intelligent document type detection using python-magic
- PDF editability analysis using PyMuPDF
- Processing track recommendation with confidence scores
- Support for PDF, images, Office docs, and text files
3. DirectExtractionEngine Service (backend/app/services/direct_extraction_engine.py)
- Fast extraction from editable PDFs using PyMuPDF
- Preserves fonts, colors, and exact positioning
- Native and positional table detection
- Image extraction with coordinates
- Hyperlink and metadata extraction
4. Dependencies
- Added PyMuPDF>=1.23.0 for PDF extraction
- Added pdfplumber>=0.10.0 as fallback
- Added python-magic-bin>=0.4.14 for file detection
Next: Integrate with OCR service for complete dual-track processing
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Critical Fix:
The previous implementation incorrectly calculated scale factors because
calculate_page_dimensions() was prioritizing source file dimensions over
OCR coordinate analysis, resulting in scale=1.0 when it should have been ~0.27.
Root Cause:
- PaddleOCR processes PDFs at high resolution (e.g., 2185x3500 pixels)
- OCR bbox coordinates are in this high-res space
- calculate_page_dimensions() was returning source PDF size (595x842) instead
- This caused scale_w=1.0, scale_h=1.0, placing all text out of bounds
Solution:
1. Rewrite calculate_page_dimensions() to:
- Accept full ocr_data instead of just text_regions
- Process both text_regions AND layout elements
- Handle polygon bbox format [[x,y], ...] correctly
- Infer OCR dimensions from max bbox coordinates FIRST
- Only fallback to source file dimensions if inference fails
2. Separate OCR dimensions from target PDF dimensions:
- ocr_width/height: Inferred from bbox (e.g., 2185x3280)
- target_width/height: From source file (e.g., 595x842)
- scale_w = target_width / ocr_width (e.g., 0.272)
- scale_h = target_height / ocr_height (e.g., 0.257)
3. Add PyPDF2 support:
- Extract dimensions from source PDF files
- Required for getting target PDF size
Changes:
- backend/app/services/pdf_generator_service.py:
- Fix calculate_page_dimensions() to infer from bbox first
- Add PyPDF2 support in get_original_page_size()
- Simplify scaling logic (removed ocr_dimensions dependency)
- Update all drawing calls to use target_height instead of page_height
- requirements.txt:
- Add PyPDF2>=3.0.0 for PDF dimension extraction
- backend/test_bbox_scaling.py:
- Add comprehensive test for high-res OCR → A4 PDF scenario
- Validates proper scale factor calculation (0.272 x 0.257)
Test Results:
✓ OCR dimensions correctly inferred: 2185.0 x 3280.0
✓ Target PDF dimensions extracted: 595.3 x 841.9
✓ Scale factors correct: X=0.272, Y=0.257
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Major Features:
- Add PDF generation service with Chinese font support
- Parse HTML tables from PP-StructureV3 and rebuild with ReportLab
- Extract table text for translation purposes
- Auto-filter text regions inside tables to avoid overlaps
Backend Changes:
1. pdf_generator_service.py (NEW)
- HTMLTableParser: Parse HTML tables to extract structure
- PDFGeneratorService: Generate layout-preserving PDFs
- Coordinate transformation: OCR (top-left) → PDF (bottom-left)
- Font size heuristics: 75% of bbox height with width checking
- Table reconstruction: Parse HTML → ReportLab Table
- Image embedding: Extract bbox from filenames
2. ocr_service.py
- Add _extract_table_text() for translation support
- Add output_dir parameter to save images to result directory
- Extract bbox from image filenames (img_in_table_box_x1_y1_x2_y2.jpg)
3. tasks.py
- Update process_task_ocr to use save_results() with PDF generation
- Fix download_pdf endpoint to use database-stored PDF paths
- Support on-demand PDF generation from JSON
4. config.py
- Add chinese_font_path configuration
- Add pdf_enable_bbox_debug flag
Frontend Changes:
1. PDFViewer.tsx (NEW)
- React PDF viewer with zoom and pagination
- Memoized file config to prevent unnecessary reloads
2. TaskDetailPage.tsx & ResultsPage.tsx
- Integrate PDF preview and download
3. main.tsx
- Configure PDF.js worker via CDN
4. vite.config.ts
- Add host: '0.0.0.0' for network access
- Use VITE_API_URL environment variable for backend proxy
Dependencies:
- reportlab: PDF generation library
- Noto Sans SC font: Chinese character support
🤖 Generated with Claude Code
https://claude.com/claude-code
Co-Authored-By: Claude <noreply@anthropic.com>
Changes to setup_dev_env.sh:
- Add support for CUDA 13.x (install CUDA 12.x compatible version)
- Use official PaddlePaddle source for GPU versions
- Install paddlepaddle-gpu==3.0.0b2 from official index
- CUDA 13.x: use cu123 package (backward compatible)
- CUDA 12.x: use cu123 package
- CUDA 11.7+: use cu118 package
- CUDA 11.2-11.6: use cu117 package
Changes to requirements.txt:
- Comment out paddlepaddle dependency
- Let setup script handle GPU/CPU version installation
This fixes the issue where pip installed CPU-only paddlepaddle 3.2.1
instead of GPU version, causing GPU acceleration to be unavailable.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>