fix: OCR Track reflow PDF and translation with image text filtering

- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 11:02:35 +08:00
parent 24253ac15e
commit 1f18010040
11 changed files with 1040 additions and 149 deletions
--- a/openspec/specs/result-export/spec.md
+++ b/openspec/specs/result-export/spec.md
@@ -58,36 +58,23 @@ Export settings (format, thresholds, templates) SHALL apply consistently to V2 t

 The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. For Direct Track, a background image rendering approach SHALL be used for visual fidelity.

-#### Scenario: Export PDF from direct extraction track
- **WHEN** exporting PDF from a direct-extraction processed document
- **THEN** the system SHALL render source PDF pages as full-page background images at 2x resolution
- **AND** overlay invisible text elements using PDF Text Rendering Mode 3
- **AND** text SHALL remain selectable and searchable despite being invisible
- **AND** visual output SHALL match source document exactly
+#### Scenario: OCR Track reflow PDF uses raw OCR regions
+- **WHEN** generating reflow PDF for an OCR Track document
+- **THEN** the system SHALL load text content from `raw_ocr_regions.json` files
+- **AND** text blocks SHALL be sorted by Y coordinate for reading order
+- **AND** all text content SHALL match the Layout PDF output
+- **AND** images and charts SHALL be embedded from element `saved_path`

-#### Scenario: Export PDF from OCR track with full structure
- **WHEN** exporting PDF from OCR-processed document
- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types
- **AND** render tables with proper cell boundaries
- **AND** maintain reading order from parsing_res_list
+#### Scenario: Direct Track reflow PDF uses structured content
+- **WHEN** generating reflow PDF for a Direct Track document
+- **THEN** the system SHALL use `content.cells` for table rendering
+- **AND** text elements SHALL use `content` string directly
+- **AND** images and charts SHALL be embedded from element `saved_path`

-#### Scenario: Handle coordinate transformations correctly
- **WHEN** generating PDF from UnifiedDocument
- **THEN** system SHALL use explicit page dimensions from OCR results (not inferred from bounding boxes)
- **AND** correctly transform Y-axis coordinates from top-left (OCR) to bottom-left (PDF/ReportLab) origin
- **AND** prevent vertical flipping or position misalignment errors
-
-#### Scenario: Direct Track PDF file size increase
- **WHEN** generating Layout PDF for Direct Track documents
- **THEN** the system SHALL accept increased file size due to embedded page images
- **AND** approximately 1-2 MB per page at 2x resolution is expected
- **AND** this trade-off is accepted for improved visual fidelity
-
-#### Scenario: Chart elements excluded from text layer
- **WHEN** generating Layout PDF containing charts
- **THEN** the system SHALL NOT include chart-internal text in the invisible text layer
- **AND** chart visuals SHALL be preserved in the background image
- **AND** chart text SHALL NOT be available for text selection or translation
+#### Scenario: Reflow PDF content consistency
+- **WHEN** comparing Layout PDF and Reflow PDF for the same document
+- **THEN** both PDFs SHALL contain the same text content
+- **AND** only the presentation format SHALL differ (positioned vs flowing)

 ### Requirement: Structure Data Export
 The system SHALL provide export formats that preserve document structure for downstream processing.