- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1.3 KiB
1.3 KiB
MODIFIED Requirements
Requirement: Enhanced PDF Export with Layout Preservation
The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. For Direct Track, a background image rendering approach SHALL be used for visual fidelity.
Scenario: OCR Track reflow PDF uses raw OCR regions
- WHEN generating reflow PDF for an OCR Track document
- THEN the system SHALL load text content from
raw_ocr_regions.jsonfiles - AND text blocks SHALL be sorted by Y coordinate for reading order
- AND all text content SHALL match the Layout PDF output
- AND images and charts SHALL be embedded from element
saved_path
Scenario: Direct Track reflow PDF uses structured content
- WHEN generating reflow PDF for a Direct Track document
- THEN the system SHALL use
content.cellsfor table rendering - AND text elements SHALL use
contentstring directly - AND images and charts SHALL be embedded from element
saved_path
Scenario: Reflow PDF content consistency
- WHEN comparing Layout PDF and Reflow PDF for the same document
- THEN both PDFs SHALL contain the same text content
- AND only the presentation format SHALL differ (positioned vs flowing)