## Context The PDF generator currently uses layout preservation mode for all PDF output, placing text at original coordinates. This works for document reconstruction but: 1. Fails for translated content where text length differs significantly 2. May not provide the best reading experience for flowing documents Two PDF generation modes are needed: 1. **Layout Preservation** (existing): Maintains original coordinates 2. **Reflow Layout** (new): Prioritizes readability with flowing content ## Goals / Non-Goals **Goals:** - Translated and non-translated documents can use reflow layout - Both OCR and Direct tracks supported - Proper reading order preserved using available data - Consistent font sizes for readability - Images and tables embedded inline **Non-Goals:** - Perfect visual matching with original document layout - Complex multi-column reflow (simple single-column flow) - Font style matching from original document ## Decisions ### Decision 1: Reading Order Strategy | Track | Reading Order Source | Implementation | |-------|---------------------|----------------| | **OCR** | Explicit `reading_order` array in JSON | Use array indices to order elements | | **Direct** | Implicit in element list order | Use list iteration order (PyMuPDF sort=True) | **OCR Track - reading_order array:** ```json { "pages": [{ "reading_order": [0, 1, 2, 3, 6, 7, 8, ...], "elements": [...] }] } ``` **Direct Track - implicit order:** - PyMuPDF's `get_text("dict", sort=True)` provides spatial reading order - Elements already sorted by extraction engine - Optional: Enable `_sort_elements_for_reading_order()` for multi-column detection ### Decision 2: Separate API Endpoints ``` # Layout preservation (existing) GET /api/v2/tasks/{task_id}/download/pdf # Reflow layout (new) GET /api/v2/tasks/{task_id}/download/pdf?format=reflow # Translated PDF (reflow only) POST /api/v2/translate/{task_id}/pdf?lang={lang} ``` ### Decision 3: Unified Reflow Generation Method ```python def generate_reflow_pdf( self, result_json_path: Path, output_path: Path, translation_json_path: Optional[Path] = None, # None = no translation source_file_path: Optional[Path] = None, # For embedded images ) -> bool: """ Generate reflow layout PDF for either OCR or Direct track. Works with or without translation. """ ``` ### Decision 4: Reading Order Application ```python def _get_elements_in_reading_order(self, page_data: dict) -> List[dict]: """Get elements sorted by reading order.""" elements = page_data.get('elements', []) reading_order = page_data.get('reading_order') if reading_order: # OCR track: use explicit reading order ordered = [] for idx in reading_order: if 0 <= idx < len(elements): ordered.append(elements[idx]) return ordered else: # Direct track: elements already in reading order return elements ``` ### Decision 5: Consistent Typography | Element Type | Font Size | Style | |-------------|-----------|-------| | Title/H1 | 18pt | Bold | | H2 | 16pt | Bold | | H3 | 14pt | Bold | | Body text | 12pt | Normal| | Table cell | 10pt | Normal| | Caption | 10pt | Italic| ### Decision 6: Table Handling in Reflow Tables use Platypus Table with auto-width columns: ```python def _create_reflow_table(self, table_data, translations=None): data = [] for row in table_data['rows']: row_data = [] for cell in row['cells']: text = cell.get('text', '') if translations: text = translations.get(cell.get('id'), text) row_data.append(Paragraph(text, self.styles['TableCell'])) data.append(row_data) table = Table(data) table.setStyle(TableStyle([ ('GRID', (0, 0), (-1, -1), 0.5, colors.black), ('VALIGN', (0, 0), (-1, -1), 'TOP'), ('PADDING', (0, 0), (-1, -1), 6), ])) return table ``` ### Decision 7: Image Embedding ```python def _embed_image_reflow(self, element, max_width=450): img_path = self._resolve_image_path(element) if img_path and img_path.exists(): img = Image(str(img_path)) # Scale to fit page width if img.drawWidth > max_width: ratio = max_width / img.drawWidth img.drawWidth = max_width img.drawHeight *= ratio return img return Spacer(1, 0) ``` ## Risks / Trade-offs - **Risk**: OCR reading_order may not be accurate for complex layouts - **Mitigation**: Falls back to spatial sort if reading_order missing - **Risk**: Direct track multi-column detection unused - **Mitigation**: PyMuPDF sort=True is generally reliable - **Risk**: Loss of visual fidelity compared to original - **Mitigation**: This is acceptable; layout PDF still available ## Migration Plan No migration needed - new functionality, existing behavior unchanged. ## Open Questions None - design confirmed with user.