Files
OCR/openspec/changes/archive/2025-12-12-fix-ocr-track-translation/specs/translation/spec.md
egg 1f18010040 fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json
- Add OCR Track translation extraction from raw_ocr_regions instead of elements
- Add raw_ocr_translations output format for OCR Track documents
- Add exclusion zone filtering to remove text overlapping with images
- Update API validation to accept both translations and raw_ocr_translations
- Add page_number field to TranslatedItem for proper tracking

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00

2.6 KiB

translation Specification Delta

MODIFIED Requirements

Requirement: Translation Content Extraction

The translation service SHALL extract content based on processing track type.

Scenario: OCR Track translation extraction

  • GIVEN a document processed with OCR Track
  • AND the result JSON has metadata.processing_track = "ocr"
  • WHEN translation service extracts translatable content
  • THEN it SHALL load raw_ocr_regions.json for each page
  • AND it SHALL extract all text blocks from raw OCR regions
  • AND it SHALL NOT rely on content.cells from table elements

Scenario: Direct Track translation extraction (unchanged)

  • GIVEN a document processed with Direct Track
  • AND the result JSON has metadata.processing_track = "direct" or no track specified
  • WHEN translation service extracts translatable content
  • THEN it SHALL extract from pages[].elements[] in result JSON
  • AND it SHALL extract table cell content from content.cells

Requirement: Translation Result Format

The translation result JSON SHALL support both element-based and raw OCR translations.

Scenario: OCR Track translation result format

  • GIVEN an OCR Track document has been translated
  • WHEN translation result is saved
  • THEN the JSON SHALL include raw_ocr_translations array
  • AND each item SHALL have index, original, and translated fields
  • AND the translations object MAY be empty or contain header text translations

Scenario: Direct Track translation result format (unchanged)

  • GIVEN a Direct Track document has been translated
  • WHEN translation result is saved
  • THEN the JSON SHALL use translations object mapping element_id to translated text
  • AND raw_ocr_translations field SHALL NOT be present

Requirement: Translated PDF Generation

The translated PDF generation SHALL use appropriate translation source based on processing track.

Scenario: OCR Track translated PDF generation

  • GIVEN an OCR Track document with translations
  • AND the translation JSON contains raw_ocr_translations
  • WHEN generating translated reflow PDF
  • THEN it SHALL apply translations from raw_ocr_translations by index
  • AND it SHALL render all translated text blocks in reading order

Scenario: Direct Track translated PDF generation (unchanged)

  • GIVEN a Direct Track document with translations
  • WHEN generating translated reflow PDF
  • THEN it SHALL apply translations from translations object by element_id
  • AND existing behavior SHALL be unchanged