Files
OCR/openspec/changes/archive/2025-11-24-pdf-layout-restoration/proposal.md
egg 4325d024a7 chore: cleanup test files and archive pdf-layout-restoration proposal
Remove obsolete test and utility scripts:
- backend/create_test_user.py
- backend/mark_migration_done.py
- backend/fix_alembic_version.py
- backend/RUN_TESTS.md (outdated test documentation)

Archive completed pdf-layout-restoration proposal:
- Moved from openspec/changes/pdf-layout-restoration/
- To openspec/changes/archive/2025-11-24-pdf-layout-restoration/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-24 19:43:05 +08:00

2.7 KiB

PDF Layout Restoration and Preservation

Problem

Currently, the PDF generation from both OCR and Direct extraction tracks produces documents that are severely degraded compared to the original, with multiple critical issues:

1. Images Never Appear

  • OCR track: pp_structure_enhanced._save_image() is an empty implementation (lines 262, 414), so detected images are never saved
  • Direct track: Image paths are saved as content["saved_path"] but converter looks for content.get("path"), causing a mismatch
  • Result: All PDFs are text-only, with no images whatsoever

2. Tables Never Render

  • Table elements use fake table_*.png references that don't exist as actual files
  • draw_table_region() tries to find these non-existent images to get bbox coordinates
  • When images aren't found, table rendering is skipped entirely
  • Result: No tables appear in generated PDFs

3. Text Layout is Broken

  • All text uses single drawString() call with entire block as one line
  • No line breaks, paragraph alignment, or text styling preserved
  • Direct track extracts StyleInfo but it's completely ignored during PDF generation
  • Result: Text appears as unformatted blocks at wrong positions

4. Information Loss in Conversion

  • Direct track data gets converted to legacy OCR format, losing rich metadata
  • Span-level information (fonts, colors, styles) is discarded
  • Precise positioning information is reduced to simple bboxes

Solution

Implement proper layout preservation for PDF generation:

  1. Fix image handling: Actually save images and use correct path references
  2. Fix table rendering: Use element's own bbox instead of looking for fake images
  3. Preserve text formatting: Use StyleInfo and span-level data for accurate rendering
  4. Track-specific rendering: Different approaches for OCR vs Direct tracks

Impact

  • User Experience: Output PDFs will actually be usable and readable
  • Functionality: Tables and images will finally appear in outputs
  • Quality: Direct track PDFs will closely match original formatting
  • Performance: No negative impact, possibly faster by avoiding unnecessary conversions

Tasks

  • Fix image saving and path references (Critical)
  • Fix table rendering using actual bbox data (Critical)
  • Implement track-specific PDF generation (Important)
  • Preserve text styling and formatting (Important)
  • Add span-level text rendering (Nice-to-have)

Deltas

result-export

+ image_handling: Proper image saving and path resolution
+ table_rendering: Direct bbox usage for table positioning
+ text_formatting: StyleInfo preservation and application
+ track_specific_rendering: OCR vs Direct track differentiation