fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,23 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Enhanced PDF Export with Layout Preservation
|
||||
|
||||
The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. For Direct Track, a background image rendering approach SHALL be used for visual fidelity.
|
||||
|
||||
#### Scenario: OCR Track reflow PDF uses raw OCR regions
|
||||
- **WHEN** generating reflow PDF for an OCR Track document
|
||||
- **THEN** the system SHALL load text content from `raw_ocr_regions.json` files
|
||||
- **AND** text blocks SHALL be sorted by Y coordinate for reading order
|
||||
- **AND** all text content SHALL match the Layout PDF output
|
||||
- **AND** images and charts SHALL be embedded from element `saved_path`
|
||||
|
||||
#### Scenario: Direct Track reflow PDF uses structured content
|
||||
- **WHEN** generating reflow PDF for a Direct Track document
|
||||
- **THEN** the system SHALL use `content.cells` for table rendering
|
||||
- **AND** text elements SHALL use `content` string directly
|
||||
- **AND** images and charts SHALL be embedded from element `saved_path`
|
||||
|
||||
#### Scenario: Reflow PDF content consistency
|
||||
- **WHEN** comparing Layout PDF and Reflow PDF for the same document
|
||||
- **THEN** both PDFs SHALL contain the same text content
|
||||
- **AND** only the presentation format SHALL differ (positioned vs flowing)
|
||||
Reference in New Issue
Block a user