OCR/pdf-layout-restoration at 3358d97624b3e4da162373db9dc68afebfbe0427 - OCR

egg/OCR

Files

egg 6d4df26223 feat: add multi-column layout support for PDF extraction and generation

- Enable PyMuPDF sort=True for correct reading order in multi-column PDFs
- Add column detection utilities (_sort_elements_for_reading_order, _detect_columns)
- Preserve extraction order in PDF generation instead of re-sorting by Y position
- Fix StyleInfo field names (font_name, font_size, text_color instead of font, size, color)
- Fix Page.dimensions access (was incorrectly accessing Page.width directly)
- Implement row-by-row reading order (top-to-bottom, left-to-right within each row)

This fixes the issue where multi-column PDFs (e.g., technical data sheets) had
incorrect element ordering, with title appearing at position 12 instead of first.
PyMuPDF's built-in sort=True parameter provides optimal reading order for most
multi-column layouts without requiring custom column detection.

Resolves: Multi-column layout reading order issue reported by user
Affects: Direct track PDF extraction and generation (Task 8)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-24 14:25:53 +08:00

specs/result-export

feat: create PDF layout restoration proposal

2025-11-20 19:00:49 +08:00

design.md

feat: create PDF layout restoration proposal

2025-11-20 19:00:49 +08:00

proposal.md

feat: create PDF layout restoration proposal

2025-11-20 19:00:49 +08:00

tasks.md

feat: add multi-column layout support for PDF extraction and generation

2025-11-24 14:25:53 +08:00