Critical Fix for User-Reported Bug:
The function was only checking layout_data.elements but not the 'layout'
field or prioritizing 'text_regions', causing it to miss all bbox data
when layout=[] (empty list) even though text_regions contained valid data.
User's Scenario (ELER-8-100HFV Data Sheet):
- JSON structure: layout=[] (empty), text_regions=[...] (has data)
- Previous code only checked layout_data.elements
- Resulted in max_x=0, max_y=0
- Fell back to source file dimensions (595x842)
- Calculated scale=1.0 instead of ~0.3
- All text with X>595 rendered out of bounds
Root Cause Analysis:
1. Different OCR outputs use different field names
2. Some use 'layout', some use 'text_regions', some use 'layout_data.elements'
3. Previous code didn't check 'layout' field at all
4. Previous code checked layout_data.elements before text_regions
5. If both were empty/missing, fell back to source dims too early
Solution:
Check ALL possible bbox sources in order of priority:
1. text_regions - Most common, contains all text boxes
2. layout - Legacy field, may be empty list
3. layout_data.elements - PP-StructureV3 format
Only fall back to source file dimensions if ALL sources are empty.
Changes:
- backend/app/services/pdf_generator_service.py:
- Rewrite calculate_page_dimensions to check all three fields
- Use explicit extend() to combine all regions
- Add type checks (isinstance) for safety
- Update warning messages to be more specific
- backend/test_empty_layout.py:
- Add test for layout=[] + text_regions=[...] scenario
- Validates scale factors are correct (~0.3, not 1.0)
Test Results:
✓ OCR dimensions inferred from text_regions: 1850.0 x 2880.0
✓ Target PDF dimensions: 595.3 x 841.9
✓ Scale factors correct: X=0.322, Y=0.292 (NOT 1.0!)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>