Files
OCR/requirements.txt
egg dc31121555 fix: correct OCR coordinate scaling by inferring dimensions from bbox
Critical Fix:
The previous implementation incorrectly calculated scale factors because
calculate_page_dimensions() was prioritizing source file dimensions over
OCR coordinate analysis, resulting in scale=1.0 when it should have been ~0.27.

Root Cause:
- PaddleOCR processes PDFs at high resolution (e.g., 2185x3500 pixels)
- OCR bbox coordinates are in this high-res space
- calculate_page_dimensions() was returning source PDF size (595x842) instead
- This caused scale_w=1.0, scale_h=1.0, placing all text out of bounds

Solution:
1. Rewrite calculate_page_dimensions() to:
   - Accept full ocr_data instead of just text_regions
   - Process both text_regions AND layout elements
   - Handle polygon bbox format [[x,y], ...] correctly
   - Infer OCR dimensions from max bbox coordinates FIRST
   - Only fallback to source file dimensions if inference fails

2. Separate OCR dimensions from target PDF dimensions:
   - ocr_width/height: Inferred from bbox (e.g., 2185x3280)
   - target_width/height: From source file (e.g., 595x842)
   - scale_w = target_width / ocr_width (e.g., 0.272)
   - scale_h = target_height / ocr_height (e.g., 0.257)

3. Add PyPDF2 support:
   - Extract dimensions from source PDF files
   - Required for getting target PDF size

Changes:
- backend/app/services/pdf_generator_service.py:
  - Fix calculate_page_dimensions() to infer from bbox first
  - Add PyPDF2 support in get_original_page_size()
  - Simplify scaling logic (removed ocr_dimensions dependency)
  - Update all drawing calls to use target_height instead of page_height

- requirements.txt:
  - Add PyPDF2>=3.0.0 for PDF dimension extraction

- backend/test_bbox_scaling.py:
  - Add comprehensive test for high-res OCR → A4 PDF scenario
  - Validates proper scale factor calculation (0.272 x 0.257)

Test Results:
✓ OCR dimensions correctly inferred: 2185.0 x 3280.0
✓ Target PDF dimensions extracted: 595.3 x 841.9
✓ Scale factors correct: X=0.272, Y=0.257

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 21:01:38 +08:00

67 lines
1.6 KiB
Plaintext

# Tool_OCR - Backend Dependencies
# Python 3.10+
# ===== Core Framework =====
fastapi==0.115.0
uvicorn[standard]==0.32.0
pydantic==2.9.2
pydantic-settings==2.6.1
email-validator>=2.0.0 # For pydantic EmailStr validation
# ===== OCR Engine =====
paddleocr>=3.0.0
# paddlepaddle>=3.0.0 # Installed separately in setup script (GPU/CPU version)
paddlex[ocr]>=3.0.0 # Required for PP-StructureV3 layout analysis
# ===== Image Processing =====
pillow>=10.0.0
pdf2image>=1.17.0
opencv-python>=4.8.0
# ===== PDF Generation =====
weasyprint>=60.0
markdown>=3.5.0
reportlab>=4.0.0 # Layout-preserving PDF generation with precise coordinate control
PyPDF2>=3.0.0 # Extract dimensions from source PDF files
# Note: pandoc needs to be installed via brew (brew install pandoc)
# ===== Data Export =====
pandas>=2.1.0
openpyxl>=3.1.0 # Excel support
# ===== Database =====
sqlalchemy>=2.0.0
pymysql>=1.1.0
alembic>=1.13.0
# ===== Authentication =====
python-jose[cryptography]>=3.3.0
passlib[bcrypt]>=1.7.4
bcrypt==4.2.1 # Pin to 4.2.1 for passlib compatibility
python-multipart>=0.0.6
# ===== Configuration =====
python-dotenv>=1.0.0
pyyaml>=6.0
# ===== HTTP Client =====
httpx>=0.25.0
requests>=2.31.0
# ===== Background Tasks (Optional) =====
# redis>=5.0.0 # Uncomment if using Redis for task queue
# celery>=5.3.0 # Uncomment if using Celery
# ===== Translation (Reserved) =====
# argostranslate>=1.9.0 # Uncomment when implementing translation
# ===== Development Tools =====
pytest>=7.4.0
pytest-asyncio>=0.21.0
pytest-cov>=4.1.0
black>=23.9.0
pylint>=3.0.0
# ===== Utilities =====
python-magic>=0.4.27 # File type detection