# PDF Layout Restoration and Preservation ## Problem Currently, the PDF generation from both OCR and Direct extraction tracks produces documents that are **severely degraded compared to the original**, with multiple critical issues: ### 1. Images Never Appear - **OCR track**: `pp_structure_enhanced._save_image()` is an empty implementation (lines 262, 414), so detected images are never saved - **Direct track**: Image paths are saved as `content["saved_path"]` but converter looks for `content.get("path")`, causing a mismatch - **Result**: All PDFs are text-only, with no images whatsoever ### 2. Tables Never Render - Table elements use fake `table_*.png` references that don't exist as actual files - `draw_table_region()` tries to find these non-existent images to get bbox coordinates - When images aren't found, table rendering is skipped entirely - **Result**: No tables appear in generated PDFs ### 3. Text Layout is Broken - All text uses single `drawString()` call with entire block as one line - No line breaks, paragraph alignment, or text styling preserved - Direct track extracts `StyleInfo` but it's completely ignored during PDF generation - **Result**: Text appears as unformatted blocks at wrong positions ### 4. Information Loss in Conversion - Direct track data gets converted to legacy OCR format, losing rich metadata - Span-level information (fonts, colors, styles) is discarded - Precise positioning information is reduced to simple bboxes ## Solution Implement proper layout preservation for PDF generation: 1. **Fix image handling**: Actually save images and use correct path references 2. **Fix table rendering**: Use element's own bbox instead of looking for fake images 3. **Preserve text formatting**: Use StyleInfo and span-level data for accurate rendering 4. **Track-specific rendering**: Different approaches for OCR vs Direct tracks ## Impact - **User Experience**: Output PDFs will actually be usable and readable - **Functionality**: Tables and images will finally appear in outputs - **Quality**: Direct track PDFs will closely match original formatting - **Performance**: No negative impact, possibly faster by avoiding unnecessary conversions ## Tasks - Fix image saving and path references (Critical) - Fix table rendering using actual bbox data (Critical) - Implement track-specific PDF generation (Important) - Preserve text styling and formatting (Important) - Add span-level text rendering (Nice-to-have) ## Deltas ### result-export ```delta + image_handling: Proper image saving and path resolution + table_rendering: Direct bbox usage for table positioning + text_formatting: StyleInfo preservation and application + track_specific_rendering: OCR vs Direct track differentiation ```