fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,56 @@
|
||||
# translation Specification Delta
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Translation Content Extraction
|
||||
|
||||
The translation service SHALL extract content based on processing track type.
|
||||
|
||||
#### Scenario: OCR Track translation extraction
|
||||
- **GIVEN** a document processed with OCR Track
|
||||
- **AND** the result JSON has `metadata.processing_track = "ocr"`
|
||||
- **WHEN** translation service extracts translatable content
|
||||
- **THEN** it SHALL load `raw_ocr_regions.json` for each page
|
||||
- **AND** it SHALL extract all text blocks from raw OCR regions
|
||||
- **AND** it SHALL NOT rely on `content.cells` from table elements
|
||||
|
||||
#### Scenario: Direct Track translation extraction (unchanged)
|
||||
- **GIVEN** a document processed with Direct Track
|
||||
- **AND** the result JSON has `metadata.processing_track = "direct"` or no track specified
|
||||
- **WHEN** translation service extracts translatable content
|
||||
- **THEN** it SHALL extract from `pages[].elements[]` in result JSON
|
||||
- **AND** it SHALL extract table cell content from `content.cells`
|
||||
|
||||
### Requirement: Translation Result Format
|
||||
|
||||
The translation result JSON SHALL support both element-based and raw OCR translations.
|
||||
|
||||
#### Scenario: OCR Track translation result format
|
||||
- **GIVEN** an OCR Track document has been translated
|
||||
- **WHEN** translation result is saved
|
||||
- **THEN** the JSON SHALL include `raw_ocr_translations` array
|
||||
- **AND** each item SHALL have `index`, `original`, and `translated` fields
|
||||
- **AND** the `translations` object MAY be empty or contain header text translations
|
||||
|
||||
#### Scenario: Direct Track translation result format (unchanged)
|
||||
- **GIVEN** a Direct Track document has been translated
|
||||
- **WHEN** translation result is saved
|
||||
- **THEN** the JSON SHALL use `translations` object mapping element_id to translated text
|
||||
- **AND** `raw_ocr_translations` field SHALL NOT be present
|
||||
|
||||
### Requirement: Translated PDF Generation
|
||||
|
||||
The translated PDF generation SHALL use appropriate translation source based on processing track.
|
||||
|
||||
#### Scenario: OCR Track translated PDF generation
|
||||
- **GIVEN** an OCR Track document with translations
|
||||
- **AND** the translation JSON contains `raw_ocr_translations`
|
||||
- **WHEN** generating translated reflow PDF
|
||||
- **THEN** it SHALL apply translations from `raw_ocr_translations` by index
|
||||
- **AND** it SHALL render all translated text blocks in reading order
|
||||
|
||||
#### Scenario: Direct Track translated PDF generation (unchanged)
|
||||
- **GIVEN** a Direct Track document with translations
|
||||
- **WHEN** generating translated reflow PDF
|
||||
- **THEN** it SHALL apply translations from `translations` object by element_id
|
||||
- **AND** existing behavior SHALL be unchanged
|
||||
Reference in New Issue
Block a user