Critical Fix: The previous implementation incorrectly calculated scale factors because calculate_page_dimensions() was prioritizing source file dimensions over OCR coordinate analysis, resulting in scale=1.0 when it should have been ~0.27. Root Cause: - PaddleOCR processes PDFs at high resolution (e.g., 2185x3500 pixels) - OCR bbox coordinates are in this high-res space - calculate_page_dimensions() was returning source PDF size (595x842) instead - This caused scale_w=1.0, scale_h=1.0, placing all text out of bounds Solution: 1. Rewrite calculate_page_dimensions() to: - Accept full ocr_data instead of just text_regions - Process both text_regions AND layout elements - Handle polygon bbox format [[x,y], ...] correctly - Infer OCR dimensions from max bbox coordinates FIRST - Only fallback to source file dimensions if inference fails 2. Separate OCR dimensions from target PDF dimensions: - ocr_width/height: Inferred from bbox (e.g., 2185x3280) - target_width/height: From source file (e.g., 595x842) - scale_w = target_width / ocr_width (e.g., 0.272) - scale_h = target_height / ocr_height (e.g., 0.257) 3. Add PyPDF2 support: - Extract dimensions from source PDF files - Required for getting target PDF size Changes: - backend/app/services/pdf_generator_service.py: - Fix calculate_page_dimensions() to infer from bbox first - Add PyPDF2 support in get_original_page_size() - Simplify scaling logic (removed ocr_dimensions dependency) - Update all drawing calls to use target_height instead of page_height - requirements.txt: - Add PyPDF2>=3.0.0 for PDF dimension extraction - backend/test_bbox_scaling.py: - Add comprehensive test for high-res OCR → A4 PDF scenario - Validates proper scale factor calculation (0.272 x 0.257) Test Results: ✓ OCR dimensions correctly inferred: 2185.0 x 3280.0 ✓ Target PDF dimensions extracted: 595.3 x 841.9 ✓ Scale factors correct: X=0.272, Y=0.257 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
67 lines
1.6 KiB
Plaintext
67 lines
1.6 KiB
Plaintext
# Tool_OCR - Backend Dependencies
|
|
# Python 3.10+
|
|
|
|
# ===== Core Framework =====
|
|
fastapi==0.115.0
|
|
uvicorn[standard]==0.32.0
|
|
pydantic==2.9.2
|
|
pydantic-settings==2.6.1
|
|
email-validator>=2.0.0 # For pydantic EmailStr validation
|
|
|
|
# ===== OCR Engine =====
|
|
paddleocr>=3.0.0
|
|
# paddlepaddle>=3.0.0 # Installed separately in setup script (GPU/CPU version)
|
|
paddlex[ocr]>=3.0.0 # Required for PP-StructureV3 layout analysis
|
|
|
|
# ===== Image Processing =====
|
|
pillow>=10.0.0
|
|
pdf2image>=1.17.0
|
|
opencv-python>=4.8.0
|
|
|
|
# ===== PDF Generation =====
|
|
weasyprint>=60.0
|
|
markdown>=3.5.0
|
|
reportlab>=4.0.0 # Layout-preserving PDF generation with precise coordinate control
|
|
PyPDF2>=3.0.0 # Extract dimensions from source PDF files
|
|
# Note: pandoc needs to be installed via brew (brew install pandoc)
|
|
|
|
# ===== Data Export =====
|
|
pandas>=2.1.0
|
|
openpyxl>=3.1.0 # Excel support
|
|
|
|
# ===== Database =====
|
|
sqlalchemy>=2.0.0
|
|
pymysql>=1.1.0
|
|
alembic>=1.13.0
|
|
|
|
# ===== Authentication =====
|
|
python-jose[cryptography]>=3.3.0
|
|
passlib[bcrypt]>=1.7.4
|
|
bcrypt==4.2.1 # Pin to 4.2.1 for passlib compatibility
|
|
python-multipart>=0.0.6
|
|
|
|
# ===== Configuration =====
|
|
python-dotenv>=1.0.0
|
|
pyyaml>=6.0
|
|
|
|
# ===== HTTP Client =====
|
|
httpx>=0.25.0
|
|
requests>=2.31.0
|
|
|
|
# ===== Background Tasks (Optional) =====
|
|
# redis>=5.0.0 # Uncomment if using Redis for task queue
|
|
# celery>=5.3.0 # Uncomment if using Celery
|
|
|
|
# ===== Translation (Reserved) =====
|
|
# argostranslate>=1.9.0 # Uncomment when implementing translation
|
|
|
|
# ===== Development Tools =====
|
|
pytest>=7.4.0
|
|
pytest-asyncio>=0.21.0
|
|
pytest-cov>=4.1.0
|
|
black>=23.9.0
|
|
pylint>=3.0.0
|
|
|
|
# ===== Utilities =====
|
|
python-magic>=0.4.27 # File type detection
|