fix: OCR Track reflow PDF and translation with image text filtering
- Add OCR Track support for reflow PDF generation using raw_ocr_regions.json - Add OCR Track translation extraction from raw_ocr_regions instead of elements - Add raw_ocr_translations output format for OCR Track documents - Add exclusion zone filtering to remove text overlapping with images - Update API validation to accept both translations and raw_ocr_translations - Add page_number field to TranslatedItem for proper tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,51 @@
|
||||
# Change: Fix OCR Track Reflow PDF
|
||||
|
||||
## Why
|
||||
|
||||
The OCR Track reflow PDF generation is missing most content because:
|
||||
|
||||
1. PP-StructureV3 extracts tables as elements but stores `content: ""` (empty string) instead of structured `content.cells` data
|
||||
2. The `generate_reflow_pdf` method expects `content.cells` for tables, so tables are skipped
|
||||
3. Table text exists in `raw_ocr_regions.json` (59 text blocks) but is not used by reflow PDF generation
|
||||
4. This causes significant content loss - only 6 text elements vs 59 raw OCR regions
|
||||
|
||||
The Layout PDF works correctly because it uses `raw_ocr_regions.json` via Simple Text Positioning mode, bypassing the need for structured table data.
|
||||
|
||||
## What Changes
|
||||
|
||||
### Reflow PDF Generation for OCR Track
|
||||
|
||||
Modify `generate_reflow_pdf` to use `raw_ocr_regions.json` as the primary text source for OCR Track documents:
|
||||
|
||||
1. **Detect processing track** from JSON metadata
|
||||
2. **For OCR Track**: Load `raw_ocr_regions.json` and render all text blocks in reading order
|
||||
3. **For Direct Track**: Continue using `content.cells` for tables (already works)
|
||||
4. **Images/Charts**: Continue using `content.saved_path` from elements (works for both tracks)
|
||||
|
||||
### Data Flow
|
||||
|
||||
**OCR Track Reflow PDF (NEW):**
|
||||
```
|
||||
raw_ocr_regions.json (59 text blocks)
|
||||
+ scan_result.json (images/charts only)
|
||||
→ Sort by Y coordinate (reading order)
|
||||
→ Render text paragraphs + images
|
||||
```
|
||||
|
||||
**Direct Track Reflow PDF (UNCHANGED):**
|
||||
```
|
||||
*_result.json (elements with content.cells)
|
||||
→ Render tables, text, images in order
|
||||
```
|
||||
|
||||
## Impact
|
||||
|
||||
- **Affected file**: `backend/app/services/pdf_generator_service.py`
|
||||
- **User experience**: OCR Track reflow PDF will contain all text content (matching Layout PDF)
|
||||
- **Translation**: Reflow translated PDF will also work correctly for OCR Track
|
||||
|
||||
## Migration
|
||||
|
||||
- No data migration required
|
||||
- Existing `raw_ocr_regions.json` files contain all necessary data
|
||||
- No API changes
|
||||
Reference in New Issue
Block a user