- Add generate_translated_layout_pdf() method for layout-preserving translated PDFs - Add generate_translated_pdf() method for reflow translated PDFs - Update translate router to accept format parameter (layout/reflow) - Update frontend with dropdown to select translated PDF format - Fix reflow PDF table cell extraction from content dict - Add embedded images handling in reflow PDF tables - Archive improve-translated-text-fitting openspec proposal 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
4.9 KiB
4.9 KiB
Context
The PDF generator currently uses layout preservation mode for all PDF output, placing text at original coordinates. This works for document reconstruction but:
- Fails for translated content where text length differs significantly
- May not provide the best reading experience for flowing documents
Two PDF generation modes are needed:
- Layout Preservation (existing): Maintains original coordinates
- Reflow Layout (new): Prioritizes readability with flowing content
Goals / Non-Goals
Goals:
- Translated and non-translated documents can use reflow layout
- Both OCR and Direct tracks supported
- Proper reading order preserved using available data
- Consistent font sizes for readability
- Images and tables embedded inline
Non-Goals:
- Perfect visual matching with original document layout
- Complex multi-column reflow (simple single-column flow)
- Font style matching from original document
Decisions
Decision 1: Reading Order Strategy
| Track | Reading Order Source | Implementation |
|---|---|---|
| OCR | Explicit reading_order array in JSON |
Use array indices to order elements |
| Direct | Implicit in element list order | Use list iteration order (PyMuPDF sort=True) |
OCR Track - reading_order array:
{
"pages": [{
"reading_order": [0, 1, 2, 3, 6, 7, 8, ...],
"elements": [...]
}]
}
Direct Track - implicit order:
- PyMuPDF's
get_text("dict", sort=True)provides spatial reading order - Elements already sorted by extraction engine
- Optional: Enable
_sort_elements_for_reading_order()for multi-column detection
Decision 2: Separate API Endpoints
# Layout preservation (existing)
GET /api/v2/tasks/{task_id}/download/pdf
# Reflow layout (new)
GET /api/v2/tasks/{task_id}/download/pdf?format=reflow
# Translated PDF (reflow only)
POST /api/v2/translate/{task_id}/pdf?lang={lang}
Decision 3: Unified Reflow Generation Method
def generate_reflow_pdf(
self,
result_json_path: Path,
output_path: Path,
translation_json_path: Optional[Path] = None, # None = no translation
source_file_path: Optional[Path] = None, # For embedded images
) -> bool:
"""
Generate reflow layout PDF for either OCR or Direct track.
Works with or without translation.
"""
Decision 4: Reading Order Application
def _get_elements_in_reading_order(self, page_data: dict) -> List[dict]:
"""Get elements sorted by reading order."""
elements = page_data.get('elements', [])
reading_order = page_data.get('reading_order')
if reading_order:
# OCR track: use explicit reading order
ordered = []
for idx in reading_order:
if 0 <= idx < len(elements):
ordered.append(elements[idx])
return ordered
else:
# Direct track: elements already in reading order
return elements
Decision 5: Consistent Typography
| Element Type | Font Size | Style |
|---|---|---|
| Title/H1 | 18pt | Bold |
| H2 | 16pt | Bold |
| H3 | 14pt | Bold |
| Body text | 12pt | Normal |
| Table cell | 10pt | Normal |
| Caption | 10pt | Italic |
Decision 6: Table Handling in Reflow
Tables use Platypus Table with auto-width columns:
def _create_reflow_table(self, table_data, translations=None):
data = []
for row in table_data['rows']:
row_data = []
for cell in row['cells']:
text = cell.get('text', '')
if translations:
text = translations.get(cell.get('id'), text)
row_data.append(Paragraph(text, self.styles['TableCell']))
data.append(row_data)
table = Table(data)
table.setStyle(TableStyle([
('GRID', (0, 0), (-1, -1), 0.5, colors.black),
('VALIGN', (0, 0), (-1, -1), 'TOP'),
('PADDING', (0, 0), (-1, -1), 6),
]))
return table
Decision 7: Image Embedding
def _embed_image_reflow(self, element, max_width=450):
img_path = self._resolve_image_path(element)
if img_path and img_path.exists():
img = Image(str(img_path))
# Scale to fit page width
if img.drawWidth > max_width:
ratio = max_width / img.drawWidth
img.drawWidth = max_width
img.drawHeight *= ratio
return img
return Spacer(1, 0)
Risks / Trade-offs
-
Risk: OCR reading_order may not be accurate for complex layouts
- Mitigation: Falls back to spatial sort if reading_order missing
-
Risk: Direct track multi-column detection unused
- Mitigation: PyMuPDF sort=True is generally reliable
-
Risk: Loss of visual fidelity compared to original
- Mitigation: This is acceptable; layout PDF still available
Migration Plan
No migration needed - new functionality, existing behavior unchanged.
Open Questions
None - design confirmed with user.