fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,51 @@
|
||||
# Change: Fix OCR Track Reflow PDF
|
||||
|
||||
## Why
|
||||
|
||||
The OCR Track reflow PDF generation is missing most content because:
|
||||
|
||||
1. PP-StructureV3 extracts tables as elements but stores `content: ""` (empty string) instead of structured `content.cells` data
|
||||
2. The `generate_reflow_pdf` method expects `content.cells` for tables, so tables are skipped
|
||||
3. Table text exists in `raw_ocr_regions.json` (59 text blocks) but is not used by reflow PDF generation
|
||||
4. This causes significant content loss - only 6 text elements vs 59 raw OCR regions
|
||||
|
||||
The Layout PDF works correctly because it uses `raw_ocr_regions.json` via Simple Text Positioning mode, bypassing the need for structured table data.
|
||||
|
||||
## What Changes
|
||||
|
||||
### Reflow PDF Generation for OCR Track
|
||||
|
||||
Modify `generate_reflow_pdf` to use `raw_ocr_regions.json` as the primary text source for OCR Track documents:
|
||||
|
||||
1. **Detect processing track** from JSON metadata
|
||||
2. **For OCR Track**: Load `raw_ocr_regions.json` and render all text blocks in reading order
|
||||
3. **For Direct Track**: Continue using `content.cells` for tables (already works)
|
||||
4. **Images/Charts**: Continue using `content.saved_path` from elements (works for both tracks)
|
||||
|
||||
### Data Flow
|
||||
|
||||
**OCR Track Reflow PDF (NEW):**
|
||||
```
|
||||
raw_ocr_regions.json (59 text blocks)
|
||||
+ scan_result.json (images/charts only)
|
||||
→ Sort by Y coordinate (reading order)
|
||||
→ Render text paragraphs + images
|
||||
```
|
||||
|
||||
**Direct Track Reflow PDF (UNCHANGED):**
|
||||
```
|
||||
*_result.json (elements with content.cells)
|
||||
→ Render tables, text, images in order
|
||||
```
|
||||
|
||||
## Impact
|
||||
|
||||
- **Affected file**: `backend/app/services/pdf_generator_service.py`
|
||||
- **User experience**: OCR Track reflow PDF will contain all text content (matching Layout PDF)
|
||||
- **Translation**: Reflow translated PDF will also work correctly for OCR Track
|
||||
|
||||
## Migration
|
||||
|
||||
- No data migration required
|
||||
- Existing `raw_ocr_regions.json` files contain all necessary data
|
||||
- No API changes
|
||||
@@ -0,0 +1,23 @@
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Enhanced PDF Export with Layout Preservation
|
||||
|
||||
The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. For Direct Track, a background image rendering approach SHALL be used for visual fidelity.
|
||||
|
||||
#### Scenario: OCR Track reflow PDF uses raw OCR regions
|
||||
- **WHEN** generating reflow PDF for an OCR Track document
|
||||
- **THEN** the system SHALL load text content from `raw_ocr_regions.json` files
|
||||
- **AND** text blocks SHALL be sorted by Y coordinate for reading order
|
||||
- **AND** all text content SHALL match the Layout PDF output
|
||||
- **AND** images and charts SHALL be embedded from element `saved_path`
|
||||
|
||||
#### Scenario: Direct Track reflow PDF uses structured content
|
||||
- **WHEN** generating reflow PDF for a Direct Track document
|
||||
- **THEN** the system SHALL use `content.cells` for table rendering
|
||||
- **AND** text elements SHALL use `content` string directly
|
||||
- **AND** images and charts SHALL be embedded from element `saved_path`
|
||||
|
||||
#### Scenario: Reflow PDF content consistency
|
||||
- **WHEN** comparing Layout PDF and Reflow PDF for the same document
|
||||
- **THEN** both PDFs SHALL contain the same text content
|
||||
- **AND** only the presentation format SHALL differ (positioned vs flowing)
|
||||
@@ -0,0 +1,51 @@
|
||||
# Tasks: Fix OCR Track Reflow PDF
|
||||
|
||||
## 1. Modify generate_reflow_pdf Method
|
||||
|
||||
- [x] 1.1 Add processing track detection
|
||||
- File: `backend/app/services/pdf_generator_service.py`
|
||||
- Location: `generate_reflow_pdf` method (line ~4704)
|
||||
- Read `metadata.processing_track` from JSON data
|
||||
- Branch logic based on track type
|
||||
|
||||
- [x] 1.2 Add helper function to load raw OCR regions
|
||||
- File: `backend/app/services/pdf_generator_service.py`
|
||||
- Using existing: `load_raw_ocr_regions` from `text_region_renderer.py`
|
||||
- Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json`
|
||||
- Return: List of text regions with bbox and content
|
||||
|
||||
- [x] 1.3 Implement OCR Track reflow rendering
|
||||
- File: `backend/app/services/pdf_generator_service.py`
|
||||
- For OCR Track: Load raw OCR regions per page
|
||||
- Sort text blocks by Y coordinate (top to bottom reading order)
|
||||
- Render text blocks as paragraphs
|
||||
- Still render images/charts from elements
|
||||
|
||||
- [x] 1.4 Keep Direct Track logic unchanged
|
||||
- File: `backend/app/services/pdf_generator_service.py`
|
||||
- Direct Track continues using `content.cells` for tables
|
||||
- Extracted to `_render_reflow_elements` helper method
|
||||
- No changes to existing Direct Track flow
|
||||
|
||||
## 2. Handle Multi-page Documents
|
||||
|
||||
- [x] 2.1 Support per-page raw OCR files
|
||||
- Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json`
|
||||
- Iterate through pages and load corresponding raw OCR file
|
||||
- Handle missing files gracefully (fall back to elements)
|
||||
|
||||
## 3. Testing
|
||||
|
||||
- [x] 3.1 Test OCR Track reflow PDF
|
||||
- Test with: `a9259180-fc49-4890-8184-2e6d5f4edad3` (scan document)
|
||||
- Verify: All 59 text blocks appear in reflow PDF
|
||||
- Verify: Images are embedded correctly
|
||||
|
||||
- [x] 3.2 Test Direct Track reflow PDF
|
||||
- Test with: `1b32428d-0609-4cfd-bc52-56be6956ac2e` (editable PDF)
|
||||
- Verify: Tables render with cells
|
||||
- Verify: No regression from changes
|
||||
|
||||
- [x] 3.3 Test translated reflow PDF
|
||||
- Test: Complete translation then download reflow PDF
|
||||
- Verify: Translated text appears correctly
|
||||
@@ -0,0 +1,70 @@
|
||||
# Change: Fix OCR Track Translation
|
||||
|
||||
## Why
|
||||
|
||||
OCR Track translation is missing most content because:
|
||||
|
||||
1. Translation service (`extract_translatable_elements`) only processes elements from `scan_result.json`
|
||||
2. OCR Track tables have `content: ""` (empty string) - no `content.cells` data
|
||||
3. All table text exists in `raw_ocr_regions.json` (59 text blocks) but translation service ignores it
|
||||
4. Result: Only 6 text elements translated vs 59 raw OCR regions available
|
||||
|
||||
**Current Data Flow (OCR Track):**
|
||||
```
|
||||
scan_result.json (10 elements, 6 text, 2 empty tables)
|
||||
→ Translation extracts 6 text items
|
||||
→ 53 text blocks in tables are NOT translated
|
||||
```
|
||||
|
||||
**Expected Data Flow (OCR Track):**
|
||||
```
|
||||
raw_ocr_regions.json (59 text blocks)
|
||||
→ Translation extracts ALL 59 text items
|
||||
→ Complete translation coverage
|
||||
```
|
||||
|
||||
## What Changes
|
||||
|
||||
### 1. Translation Service Enhancement
|
||||
|
||||
Modify `translate_document` in `translation_service.py` to:
|
||||
|
||||
1. **Detect processing track** from result JSON metadata
|
||||
2. **For OCR Track**: Load and translate `raw_ocr_regions.json` instead of elements
|
||||
3. **For Direct Track**: Continue using elements with `content.cells` (already works)
|
||||
|
||||
### 2. Translation Result Format for OCR Track
|
||||
|
||||
Add new field `raw_ocr_translations` to translation JSON for OCR Track:
|
||||
|
||||
```json
|
||||
{
|
||||
"translations": { ... }, // element-based (for Direct Track)
|
||||
"raw_ocr_translations": [ // NEW: for OCR Track
|
||||
{
|
||||
"index": 0,
|
||||
"original": "华天科技(宝鸡)有限公司",
|
||||
"translated": "Huatian Technology (Baoji) Co., Ltd."
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Translated PDF Generation
|
||||
|
||||
Modify `generate_translated_pdf` to use `raw_ocr_translations` when available for OCR Track documents.
|
||||
|
||||
## Impact
|
||||
|
||||
- **Affected files**:
|
||||
- `backend/app/services/translation_service.py` - extraction and translation logic
|
||||
- `backend/app/services/pdf_generator_service.py` - translated PDF rendering
|
||||
- **User experience**: OCR Track translations will include ALL text content
|
||||
- **API**: Translation JSON format extended (backward compatible)
|
||||
|
||||
## Migration
|
||||
|
||||
- No data migration required
|
||||
- Existing translations continue to work (Direct Track unaffected)
|
||||
- Re-translation needed for OCR Track documents to get full coverage
|
||||
@@ -0,0 +1,56 @@
|
||||
# translation Specification Delta
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Translation Content Extraction
|
||||
|
||||
The translation service SHALL extract content based on processing track type.
|
||||
|
||||
#### Scenario: OCR Track translation extraction
|
||||
- **GIVEN** a document processed with OCR Track
|
||||
- **AND** the result JSON has `metadata.processing_track = "ocr"`
|
||||
- **WHEN** translation service extracts translatable content
|
||||
- **THEN** it SHALL load `raw_ocr_regions.json` for each page
|
||||
- **AND** it SHALL extract all text blocks from raw OCR regions
|
||||
- **AND** it SHALL NOT rely on `content.cells` from table elements
|
||||
|
||||
#### Scenario: Direct Track translation extraction (unchanged)
|
||||
- **GIVEN** a document processed with Direct Track
|
||||
- **AND** the result JSON has `metadata.processing_track = "direct"` or no track specified
|
||||
- **WHEN** translation service extracts translatable content
|
||||
- **THEN** it SHALL extract from `pages[].elements[]` in result JSON
|
||||
- **AND** it SHALL extract table cell content from `content.cells`
|
||||
|
||||
### Requirement: Translation Result Format
|
||||
|
||||
The translation result JSON SHALL support both element-based and raw OCR translations.
|
||||
|
||||
#### Scenario: OCR Track translation result format
|
||||
- **GIVEN** an OCR Track document has been translated
|
||||
- **WHEN** translation result is saved
|
||||
- **THEN** the JSON SHALL include `raw_ocr_translations` array
|
||||
- **AND** each item SHALL have `index`, `original`, and `translated` fields
|
||||
- **AND** the `translations` object MAY be empty or contain header text translations
|
||||
|
||||
#### Scenario: Direct Track translation result format (unchanged)
|
||||
- **GIVEN** a Direct Track document has been translated
|
||||
- **WHEN** translation result is saved
|
||||
- **THEN** the JSON SHALL use `translations` object mapping element_id to translated text
|
||||
- **AND** `raw_ocr_translations` field SHALL NOT be present
|
||||
|
||||
### Requirement: Translated PDF Generation
|
||||
|
||||
The translated PDF generation SHALL use appropriate translation source based on processing track.
|
||||
|
||||
#### Scenario: OCR Track translated PDF generation
|
||||
- **GIVEN** an OCR Track document with translations
|
||||
- **AND** the translation JSON contains `raw_ocr_translations`
|
||||
- **WHEN** generating translated reflow PDF
|
||||
- **THEN** it SHALL apply translations from `raw_ocr_translations` by index
|
||||
- **AND** it SHALL render all translated text blocks in reading order
|
||||
|
||||
#### Scenario: Direct Track translated PDF generation (unchanged)
|
||||
- **GIVEN** a Direct Track document with translations
|
||||
- **WHEN** generating translated reflow PDF
|
||||
- **THEN** it SHALL apply translations from `translations` object by element_id
|
||||
- **AND** existing behavior SHALL be unchanged
|
||||
@@ -0,0 +1,76 @@
|
||||
# Tasks: Fix OCR Track Translation
|
||||
|
||||
## 1. Modify Translation Service
|
||||
|
||||
- [x] 1.1 Add processing track detection
|
||||
- File: `backend/app/services/translation_service.py`
|
||||
- Location: `translate_document` method
|
||||
- Read `metadata.processing_track` from result JSON
|
||||
- Pass track type to extraction method
|
||||
|
||||
- [x] 1.2 Create helper to load raw OCR regions
|
||||
- File: `backend/app/services/translation_service.py`
|
||||
- Function: `_load_raw_ocr_regions(result_dir, task_id, page_num)`
|
||||
- Pattern: `{task_id}_*_page_{page_num}_raw_ocr_regions.json`
|
||||
- Return: List of text regions with index and content
|
||||
|
||||
- [x] 1.3 Modify extract_translatable_elements for OCR Track
|
||||
- File: `backend/app/services/translation_service.py`
|
||||
- Added: `extract_translatable_elements_ocr_track` method
|
||||
- Added parameters: `result_dir: Path`, `task_id: str`
|
||||
- For OCR Track: Extract from raw_ocr_regions.json
|
||||
- For Direct Track: Keep existing element-based extraction
|
||||
|
||||
- [x] 1.4 Update translation result format
|
||||
- File: `backend/app/services/translation_service.py`
|
||||
- Location: `build_translation_result` method
|
||||
- Added `processing_track` parameter
|
||||
- For OCR Track: Output `raw_ocr_translations` field
|
||||
- Structure: `[{"page": 1, "index": 0, "original": "...", "translated": "..."}]`
|
||||
|
||||
## 2. Modify PDF Generation
|
||||
|
||||
- [x] 2.1 Update generate_translated_pdf for OCR Track
|
||||
- File: `backend/app/services/pdf_generator_service.py`
|
||||
- Detect `processing_track` and `raw_ocr_translations` from translation JSON
|
||||
- For OCR Track: Call `_generate_translated_pdf_ocr_track`
|
||||
- For Direct Track: Continue using `apply_translations` (element-based)
|
||||
|
||||
- [x] 2.2 Create helper to apply raw OCR translations
|
||||
- File: `backend/app/services/pdf_generator_service.py`
|
||||
- Function: `_generate_translated_pdf_ocr_track`
|
||||
- Build translation lookup: `{(page, index): translated_text}`
|
||||
- Load raw OCR regions, sort by Y coordinate
|
||||
- Render translated text with original fallback
|
||||
|
||||
## 3. Additional Fixes
|
||||
|
||||
- [x] 3.1 Add page_number to TranslatedItem
|
||||
- File: `backend/app/schemas/translation.py`
|
||||
- Added `page_number: int = 1` to TranslatedItem dataclass
|
||||
- Updated `translate_batch` and `translate_item` to pass page_number
|
||||
|
||||
- [x] 3.2 Update API endpoint validation
|
||||
- File: `backend/app/routers/translate.py`
|
||||
- Check for both `translations` (Direct Track) and `raw_ocr_translations` (OCR Track)
|
||||
|
||||
- [x] 3.3 Filter text overlapping with images
|
||||
- File: `backend/app/services/pdf_generator_service.py`
|
||||
- Added `_collect_exclusion_zones`, `_is_region_overlapping_exclusion`, `_filter_regions_by_exclusion`
|
||||
- Applied filtering in `generate_reflow_pdf` and `_generate_translated_pdf_ocr_track`
|
||||
|
||||
## 4. Testing
|
||||
|
||||
- [x] 4.1 Test OCR Track translation
|
||||
- Test with: `f8265449-6cb7-425d-a213-5d2e1af73955`
|
||||
- Verify: All 59 text blocks are sent for translation
|
||||
- Verify: Translation JSON contains `raw_ocr_translations`
|
||||
|
||||
- [x] 4.2 Test OCR Track translated PDF
|
||||
- Generate translated reflow PDF
|
||||
- Verify: All translated text blocks appear correctly
|
||||
- Verify: Text inside images (like EWsenel) is filtered out
|
||||
|
||||
- [x] 4.3 Test Direct Track unchanged
|
||||
- Verify: Translation still uses element-based approach
|
||||
- Verify: No regression in Direct Track flow
|
||||
@@ -58,36 +58,23 @@ Export settings (format, thresholds, templates) SHALL apply consistently to V2 t
|
||||
|
||||
The PDF export SHALL accurately preserve document layout from both OCR and direct extraction tracks with correct coordinate transformation and multi-page support. For Direct Track, a background image rendering approach SHALL be used for visual fidelity.
|
||||
|
||||
#### Scenario: Export PDF from direct extraction track
|
||||
- **WHEN** exporting PDF from a direct-extraction processed document
|
||||
- **THEN** the system SHALL render source PDF pages as full-page background images at 2x resolution
|
||||
- **AND** overlay invisible text elements using PDF Text Rendering Mode 3
|
||||
- **AND** text SHALL remain selectable and searchable despite being invisible
|
||||
- **AND** visual output SHALL match source document exactly
|
||||
#### Scenario: OCR Track reflow PDF uses raw OCR regions
|
||||
- **WHEN** generating reflow PDF for an OCR Track document
|
||||
- **THEN** the system SHALL load text content from `raw_ocr_regions.json` files
|
||||
- **AND** text blocks SHALL be sorted by Y coordinate for reading order
|
||||
- **AND** all text content SHALL match the Layout PDF output
|
||||
- **AND** images and charts SHALL be embedded from element `saved_path`
|
||||
|
||||
#### Scenario: Export PDF from OCR track with full structure
|
||||
- **WHEN** exporting PDF from OCR-processed document
|
||||
- **THEN** the PDF SHALL use all 23 PP-StructureV3 element types
|
||||
- **AND** render tables with proper cell boundaries
|
||||
- **AND** maintain reading order from parsing_res_list
|
||||
#### Scenario: Direct Track reflow PDF uses structured content
|
||||
- **WHEN** generating reflow PDF for a Direct Track document
|
||||
- **THEN** the system SHALL use `content.cells` for table rendering
|
||||
- **AND** text elements SHALL use `content` string directly
|
||||
- **AND** images and charts SHALL be embedded from element `saved_path`
|
||||
|
||||
#### Scenario: Handle coordinate transformations correctly
|
||||
- **WHEN** generating PDF from UnifiedDocument
|
||||
- **THEN** system SHALL use explicit page dimensions from OCR results (not inferred from bounding boxes)
|
||||
- **AND** correctly transform Y-axis coordinates from top-left (OCR) to bottom-left (PDF/ReportLab) origin
|
||||
- **AND** prevent vertical flipping or position misalignment errors
|
||||
|
||||
#### Scenario: Direct Track PDF file size increase
|
||||
- **WHEN** generating Layout PDF for Direct Track documents
|
||||
- **THEN** the system SHALL accept increased file size due to embedded page images
|
||||
- **AND** approximately 1-2 MB per page at 2x resolution is expected
|
||||
- **AND** this trade-off is accepted for improved visual fidelity
|
||||
|
||||
#### Scenario: Chart elements excluded from text layer
|
||||
- **WHEN** generating Layout PDF containing charts
|
||||
- **THEN** the system SHALL NOT include chart-internal text in the invisible text layer
|
||||
- **AND** chart visuals SHALL be preserved in the background image
|
||||
- **AND** chart text SHALL NOT be available for text selection or translation
|
||||
#### Scenario: Reflow PDF content consistency
|
||||
- **WHEN** comparing Layout PDF and Reflow PDF for the same document
|
||||
- **THEN** both PDFs SHALL contain the same text content
|
||||
- **AND** only the presentation format SHALL differ (positioned vs flowing)
|
||||
|
||||
### Requirement: Structure Data Export
|
||||
The system SHALL provide export formats that preserve document structure for downstream processing.
|
||||
|
||||
Reference in New Issue
Block a user