egg/OCR

Files

egg 4325d024a7 chore: cleanup test files and archive pdf-layout-restoration proposal

Remove obsolete test and utility scripts:
- backend/create_test_user.py
- backend/mark_migration_done.py
- backend/fix_alembic_version.py
- backend/RUN_TESTS.md (outdated test documentation)

Archive completed pdf-layout-restoration proposal:
- Moved from openspec/changes/pdf-layout-restoration/
- To openspec/changes/archive/2025-11-24-pdf-layout-restoration/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-24 19:43:05 +08:00

2.7 KiB

Raw Blame History

PDF Layout Restoration and Preservation

Problem

Currently, the PDF generation from both OCR and Direct extraction tracks produces documents that are severely degraded compared to the original, with multiple critical issues:

1. Images Never Appear

OCR track: pp_structure_enhanced._save_image() is an empty implementation (lines 262, 414), so detected images are never saved
Direct track: Image paths are saved as content["saved_path"] but converter looks for content.get("path"), causing a mismatch
Result: All PDFs are text-only, with no images whatsoever

2. Tables Never Render

Table elements use fake table_*.png references that don't exist as actual files
draw_table_region() tries to find these non-existent images to get bbox coordinates
When images aren't found, table rendering is skipped entirely
Result: No tables appear in generated PDFs

3. Text Layout is Broken

All text uses single drawString() call with entire block as one line
No line breaks, paragraph alignment, or text styling preserved
Direct track extracts StyleInfo but it's completely ignored during PDF generation
Result: Text appears as unformatted blocks at wrong positions

4. Information Loss in Conversion

Direct track data gets converted to legacy OCR format, losing rich metadata
Span-level information (fonts, colors, styles) is discarded
Precise positioning information is reduced to simple bboxes

Solution

Implement proper layout preservation for PDF generation:

Fix image handling: Actually save images and use correct path references
Fix table rendering: Use element's own bbox instead of looking for fake images
Preserve text formatting: Use StyleInfo and span-level data for accurate rendering
Track-specific rendering: Different approaches for OCR vs Direct tracks

Impact

User Experience: Output PDFs will actually be usable and readable
Functionality: Tables and images will finally appear in outputs
Quality: Direct track PDFs will closely match original formatting
Performance: No negative impact, possibly faster by avoiding unnecessary conversions

Tasks

Fix image saving and path references (Critical)
Fix table rendering using actual bbox data (Critical)
Implement track-specific PDF generation (Important)
Preserve text styling and formatting (Important)
Add span-level text rendering (Nice-to-have)

Deltas

result-export

+ image_handling: Proper image saving and path resolution
+ table_rendering: Direct bbox usage for table positioning
+ text_formatting: StyleInfo preservation and application
+ track_specific_rendering: OCR vs Direct track differentiation

2.7 KiB Raw Blame History