- Add generate_translated_layout_pdf() method for layout-preserving translated PDFs - Add generate_translated_pdf() method for reflow translated PDFs - Update translate router to accept format parameter (layout/reflow) - Update frontend with dropdown to select translated PDF format - Fix reflow PDF table cell extraction from content dict - Add embedded images handling in reflow PDF tables - Archive improve-translated-text-fitting openspec proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
168 lines
4.9 KiB
Markdown
168 lines
4.9 KiB
Markdown
## Context
|
|
|
|
The PDF generator currently uses layout preservation mode for all PDF output, placing text at original coordinates. This works for document reconstruction but:
|
|
1. Fails for translated content where text length differs significantly
|
|
2. May not provide the best reading experience for flowing documents
|
|
|
|
Two PDF generation modes are needed:
|
|
1. **Layout Preservation** (existing): Maintains original coordinates
|
|
2. **Reflow Layout** (new): Prioritizes readability with flowing content
|
|
|
|
## Goals / Non-Goals
|
|
|
|
**Goals:**
|
|
- Translated and non-translated documents can use reflow layout
|
|
- Both OCR and Direct tracks supported
|
|
- Proper reading order preserved using available data
|
|
- Consistent font sizes for readability
|
|
- Images and tables embedded inline
|
|
|
|
**Non-Goals:**
|
|
- Perfect visual matching with original document layout
|
|
- Complex multi-column reflow (simple single-column flow)
|
|
- Font style matching from original document
|
|
|
|
## Decisions
|
|
|
|
### Decision 1: Reading Order Strategy
|
|
|
|
| Track | Reading Order Source | Implementation |
|
|
|-------|---------------------|----------------|
|
|
| **OCR** | Explicit `reading_order` array in JSON | Use array indices to order elements |
|
|
| **Direct** | Implicit in element list order | Use list iteration order (PyMuPDF sort=True) |
|
|
|
|
**OCR Track - reading_order array:**
|
|
```json
|
|
{
|
|
"pages": [{
|
|
"reading_order": [0, 1, 2, 3, 6, 7, 8, ...],
|
|
"elements": [...]
|
|
}]
|
|
}
|
|
```
|
|
|
|
**Direct Track - implicit order:**
|
|
- PyMuPDF's `get_text("dict", sort=True)` provides spatial reading order
|
|
- Elements already sorted by extraction engine
|
|
- Optional: Enable `_sort_elements_for_reading_order()` for multi-column detection
|
|
|
|
### Decision 2: Separate API Endpoints
|
|
|
|
```
|
|
# Layout preservation (existing)
|
|
GET /api/v2/tasks/{task_id}/download/pdf
|
|
|
|
# Reflow layout (new)
|
|
GET /api/v2/tasks/{task_id}/download/pdf?format=reflow
|
|
|
|
# Translated PDF (reflow only)
|
|
POST /api/v2/translate/{task_id}/pdf?lang={lang}
|
|
```
|
|
|
|
### Decision 3: Unified Reflow Generation Method
|
|
|
|
```python
|
|
def generate_reflow_pdf(
|
|
self,
|
|
result_json_path: Path,
|
|
output_path: Path,
|
|
translation_json_path: Optional[Path] = None, # None = no translation
|
|
source_file_path: Optional[Path] = None, # For embedded images
|
|
) -> bool:
|
|
"""
|
|
Generate reflow layout PDF for either OCR or Direct track.
|
|
Works with or without translation.
|
|
"""
|
|
```
|
|
|
|
### Decision 4: Reading Order Application
|
|
|
|
```python
|
|
def _get_elements_in_reading_order(self, page_data: dict) -> List[dict]:
|
|
"""Get elements sorted by reading order."""
|
|
elements = page_data.get('elements', [])
|
|
reading_order = page_data.get('reading_order')
|
|
|
|
if reading_order:
|
|
# OCR track: use explicit reading order
|
|
ordered = []
|
|
for idx in reading_order:
|
|
if 0 <= idx < len(elements):
|
|
ordered.append(elements[idx])
|
|
return ordered
|
|
else:
|
|
# Direct track: elements already in reading order
|
|
return elements
|
|
```
|
|
|
|
### Decision 5: Consistent Typography
|
|
|
|
| Element Type | Font Size | Style |
|
|
|-------------|-----------|-------|
|
|
| Title/H1 | 18pt | Bold |
|
|
| H2 | 16pt | Bold |
|
|
| H3 | 14pt | Bold |
|
|
| Body text | 12pt | Normal|
|
|
| Table cell | 10pt | Normal|
|
|
| Caption | 10pt | Italic|
|
|
|
|
### Decision 6: Table Handling in Reflow
|
|
|
|
Tables use Platypus Table with auto-width columns:
|
|
|
|
```python
|
|
def _create_reflow_table(self, table_data, translations=None):
|
|
data = []
|
|
for row in table_data['rows']:
|
|
row_data = []
|
|
for cell in row['cells']:
|
|
text = cell.get('text', '')
|
|
if translations:
|
|
text = translations.get(cell.get('id'), text)
|
|
row_data.append(Paragraph(text, self.styles['TableCell']))
|
|
data.append(row_data)
|
|
|
|
table = Table(data)
|
|
table.setStyle(TableStyle([
|
|
('GRID', (0, 0), (-1, -1), 0.5, colors.black),
|
|
('VALIGN', (0, 0), (-1, -1), 'TOP'),
|
|
('PADDING', (0, 0), (-1, -1), 6),
|
|
]))
|
|
return table
|
|
```
|
|
|
|
### Decision 7: Image Embedding
|
|
|
|
```python
|
|
def _embed_image_reflow(self, element, max_width=450):
|
|
img_path = self._resolve_image_path(element)
|
|
if img_path and img_path.exists():
|
|
img = Image(str(img_path))
|
|
# Scale to fit page width
|
|
if img.drawWidth > max_width:
|
|
ratio = max_width / img.drawWidth
|
|
img.drawWidth = max_width
|
|
img.drawHeight *= ratio
|
|
return img
|
|
return Spacer(1, 0)
|
|
```
|
|
|
|
## Risks / Trade-offs
|
|
|
|
- **Risk**: OCR reading_order may not be accurate for complex layouts
|
|
- **Mitigation**: Falls back to spatial sort if reading_order missing
|
|
|
|
- **Risk**: Direct track multi-column detection unused
|
|
- **Mitigation**: PyMuPDF sort=True is generally reliable
|
|
|
|
- **Risk**: Loss of visual fidelity compared to original
|
|
- **Mitigation**: This is acceptable; layout PDF still available
|
|
|
|
## Migration Plan
|
|
|
|
No migration needed - new functionality, existing behavior unchanged.
|
|
|
|
## Open Questions
|
|
|
|
None - design confirmed with user.
|