Files
OCR/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/specs/translation/spec.md
egg 1f18010040 fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json
- Add OCR Track translation extraction from raw_ocr_regions instead of elements
- Add raw_ocr_translations output format for OCR Track documents
- Add exclusion zone filtering to remove text overlapping with images
- Update API validation to accept both translations and raw_ocr_translations
- Add page_number field to TranslatedItem for proper tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00

57 lines
2.6 KiB
Markdown

# translation Specification Delta
## MODIFIED Requirements
### Requirement: Translation Content Extraction
The translation service SHALL extract content based on processing track type.
#### Scenario: OCR Track translation extraction
- **GIVEN** a document processed with OCR Track
- **AND** the result JSON has `metadata.processing_track = "ocr"`
- **WHEN** translation service extracts translatable content
- **THEN** it SHALL load `raw_ocr_regions.json` for each page
- **AND** it SHALL extract all text blocks from raw OCR regions
- **AND** it SHALL NOT rely on `content.cells` from table elements
#### Scenario: Direct Track translation extraction (unchanged)
- **GIVEN** a document processed with Direct Track
- **AND** the result JSON has `metadata.processing_track = "direct"` or no track specified
- **WHEN** translation service extracts translatable content
- **THEN** it SHALL extract from `pages[].elements[]` in result JSON
- **AND** it SHALL extract table cell content from `content.cells`
### Requirement: Translation Result Format
The translation result JSON SHALL support both element-based and raw OCR translations.
#### Scenario: OCR Track translation result format
- **GIVEN** an OCR Track document has been translated
- **WHEN** translation result is saved
- **THEN** the JSON SHALL include `raw_ocr_translations` array
- **AND** each item SHALL have `index`, `original`, and `translated` fields
- **AND** the `translations` object MAY be empty or contain header text translations
#### Scenario: Direct Track translation result format (unchanged)
- **GIVEN** a Direct Track document has been translated
- **WHEN** translation result is saved
- **THEN** the JSON SHALL use `translations` object mapping element_id to translated text
- **AND** `raw_ocr_translations` field SHALL NOT be present
### Requirement: Translated PDF Generation
The translated PDF generation SHALL use appropriate translation source based on processing track.
#### Scenario: OCR Track translated PDF generation
- **GIVEN** an OCR Track document with translations
- **AND** the translation JSON contains `raw_ocr_translations`
- **WHEN** generating translated reflow PDF
- **THEN** it SHALL apply translations from `raw_ocr_translations` by index
- **AND** it SHALL render all translated text blocks in reading order
#### Scenario: Direct Track translated PDF generation (unchanged)
- **GIVEN** a Direct Track document with translations
- **WHEN** generating translated reflow PDF
- **THEN** it SHALL apply translations from `translations` object by element_id
- **AND** existing behavior SHALL be unchanged