chore: cleanup test files and archive pdf-layout-restoration proposal
Remove obsolete test and utility scripts: - backend/create_test_user.py - backend/mark_migration_done.py - backend/fix_alembic_version.py - backend/RUN_TESTS.md (outdated test documentation) Archive completed pdf-layout-restoration proposal: - Moved from openspec/changes/pdf-layout-restoration/ - To openspec/changes/archive/2025-11-24-pdf-layout-restoration/ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,57 @@
|
||||
# PDF Layout Restoration and Preservation
|
||||
|
||||
## Problem
|
||||
Currently, the PDF generation from both OCR and Direct extraction tracks produces documents that are **severely degraded compared to the original**, with multiple critical issues:
|
||||
|
||||
### 1. Images Never Appear
|
||||
- **OCR track**: `pp_structure_enhanced._save_image()` is an empty implementation (lines 262, 414), so detected images are never saved
|
||||
- **Direct track**: Image paths are saved as `content["saved_path"]` but converter looks for `content.get("path")`, causing a mismatch
|
||||
- **Result**: All PDFs are text-only, with no images whatsoever
|
||||
|
||||
### 2. Tables Never Render
|
||||
- Table elements use fake `table_*.png` references that don't exist as actual files
|
||||
- `draw_table_region()` tries to find these non-existent images to get bbox coordinates
|
||||
- When images aren't found, table rendering is skipped entirely
|
||||
- **Result**: No tables appear in generated PDFs
|
||||
|
||||
### 3. Text Layout is Broken
|
||||
- All text uses single `drawString()` call with entire block as one line
|
||||
- No line breaks, paragraph alignment, or text styling preserved
|
||||
- Direct track extracts `StyleInfo` but it's completely ignored during PDF generation
|
||||
- **Result**: Text appears as unformatted blocks at wrong positions
|
||||
|
||||
### 4. Information Loss in Conversion
|
||||
- Direct track data gets converted to legacy OCR format, losing rich metadata
|
||||
- Span-level information (fonts, colors, styles) is discarded
|
||||
- Precise positioning information is reduced to simple bboxes
|
||||
|
||||
## Solution
|
||||
Implement proper layout preservation for PDF generation:
|
||||
|
||||
1. **Fix image handling**: Actually save images and use correct path references
|
||||
2. **Fix table rendering**: Use element's own bbox instead of looking for fake images
|
||||
3. **Preserve text formatting**: Use StyleInfo and span-level data for accurate rendering
|
||||
4. **Track-specific rendering**: Different approaches for OCR vs Direct tracks
|
||||
|
||||
## Impact
|
||||
- **User Experience**: Output PDFs will actually be usable and readable
|
||||
- **Functionality**: Tables and images will finally appear in outputs
|
||||
- **Quality**: Direct track PDFs will closely match original formatting
|
||||
- **Performance**: No negative impact, possibly faster by avoiding unnecessary conversions
|
||||
|
||||
## Tasks
|
||||
- Fix image saving and path references (Critical)
|
||||
- Fix table rendering using actual bbox data (Critical)
|
||||
- Implement track-specific PDF generation (Important)
|
||||
- Preserve text styling and formatting (Important)
|
||||
- Add span-level text rendering (Nice-to-have)
|
||||
|
||||
## Deltas
|
||||
|
||||
### result-export
|
||||
```delta
|
||||
+ image_handling: Proper image saving and path resolution
|
||||
+ table_rendering: Direct bbox usage for table positioning
|
||||
+ text_formatting: StyleInfo preservation and application
|
||||
+ track_specific_rendering: OCR vs Direct track differentiation
|
||||
```
|
||||
Reference in New Issue
Block a user