- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.6 KiB
2.6 KiB
translation Specification Delta
MODIFIED Requirements
Requirement: Translation Content Extraction
The translation service SHALL extract content based on processing track type.
Scenario: OCR Track translation extraction
- GIVEN a document processed with OCR Track
- AND the result JSON has
metadata.processing_track = "ocr" - WHEN translation service extracts translatable content
- THEN it SHALL load
raw_ocr_regions.jsonfor each page - AND it SHALL extract all text blocks from raw OCR regions
- AND it SHALL NOT rely on
content.cellsfrom table elements
Scenario: Direct Track translation extraction (unchanged)
- GIVEN a document processed with Direct Track
- AND the result JSON has
metadata.processing_track = "direct"or no track specified - WHEN translation service extracts translatable content
- THEN it SHALL extract from
pages[].elements[]in result JSON - AND it SHALL extract table cell content from
content.cells
Requirement: Translation Result Format
The translation result JSON SHALL support both element-based and raw OCR translations.
Scenario: OCR Track translation result format
- GIVEN an OCR Track document has been translated
- WHEN translation result is saved
- THEN the JSON SHALL include
raw_ocr_translationsarray - AND each item SHALL have
index,original, andtranslatedfields - AND the
translationsobject MAY be empty or contain header text translations
Scenario: Direct Track translation result format (unchanged)
- GIVEN a Direct Track document has been translated
- WHEN translation result is saved
- THEN the JSON SHALL use
translationsobject mapping element_id to translated text - AND
raw_ocr_translationsfield SHALL NOT be present
Requirement: Translated PDF Generation
The translated PDF generation SHALL use appropriate translation source based on processing track.
Scenario: OCR Track translated PDF generation
- GIVEN an OCR Track document with translations
- AND the translation JSON contains
raw_ocr_translations - WHEN generating translated reflow PDF
- THEN it SHALL apply translations from
raw_ocr_translationsby index - AND it SHALL render all translated text blocks in reading order
Scenario: Direct Track translated PDF generation (unchanged)
- GIVEN a Direct Track document with translations
- WHEN generating translated reflow PDF
- THEN it SHALL apply translations from
translationsobject by element_id - AND existing behavior SHALL be unchanged