Files
OCR/backend/tests
egg 8333182879 fix: correct Y-axis positioning and implement span-based rendering
CRITICAL BUG FIXES (Based on expert analysis):

Bug A - Y-axis Starting Position Error:
- Previous code used bbox.y1 (bottom) as starting point for multi-line text
- Caused first line to render at last line position, text overflowing downward
- FIX: Span-based rendering now uses `page_height - span.bbox.y1 + (font_size * 0.2)`
  to approximate baseline position for each span individually
- FIX: Block-level fallback starts from bbox.y0 (top), draws lines downward:
  `pdf_y_top = page_height - bbox.y0`, then `line_y = pdf_y_top - ((i + 1) * line_height)`

Bug B - Spans Compressed to First Line:
- Previous code forced all spans to render only on first line (if i == 0 check)
- Destroyed multi-line and multi-column layouts by compressing paragraphs
- FIX: Prioritize span-based rendering - each span uses its own precise bbox
- FIX: Removed line iteration for spans - they already have correct coordinates
- FIX: Return immediately after drawing spans to prevent block text overlap

Implementation Changes:

1. Span-Based Rendering (Priority Path):
   - Iterate through element.children (spans) with precise bbox from PyMuPDF
   - Each span positioned independently using its own coordinates
   - Apply per-span StyleInfo (font_name, font_size, font_weight, font_style)
   - Transform coordinates: span_pdf_y = page_height - s_bbox.y1 + (font_size * 0.2)
   - Used for 84% of text elements (16/19 elements in test)

2. Block-Level Fallback (Corrected Y-Axis):
   - Used when no spans available (filtered/modified text)
   - Start from TOP: pdf_y_top = page_height - bbox.y0
   - Draw lines downward: line_y = pdf_y_top - ((i + 1) * line_height)
   - Maintains proper line spacing and paragraph flow

3. Testing:
   - Added comprehensive E2E test suite (test_pdf_layout_restoration.py)
   - Quick visual verification test (quick_visual_test.py)
   - Test results documented in TEST_RESULTS_SPAN_FIX.md

Test Results:
 PDF generation: 14,172 bytes, 3 pages with content
 Span rendering: 84% of elements (16/19) using precise bbox
 Font sizes: Correct 10pt (not 35pt from bbox_height)
 Line count: 152 lines (proper spacing, no compression)
 Reading order: Correct left-right, top-bottom pattern
 First line: "Technical Data Sheet" (verified correct)

Files Changed:
- backend/app/services/pdf_generator_service.py: Complete rewrite of
  _draw_text_element_direct() method (lines 1796-2024)
- backend/tests/e2e/test_pdf_layout_restoration.py: New E2E test suite
- backend/tests/e2e/TEST_RESULTS_SPAN_FIX.md: Comprehensive test results

References:
- Expert analysis identified Y-axis and span compression bugs
- Solution prioritizes PyMuPDF's precise span-level bbox data
- Maintains backward compatibility with block-level fallback

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 14:57:27 +08:00
..