test
This commit is contained in:
@@ -0,0 +1,167 @@
|
||||
## Context
|
||||
|
||||
The PDF generator currently uses layout preservation mode for all PDF output, placing text at original coordinates. This works for document reconstruction but:
|
||||
1. Fails for translated content where text length differs significantly
|
||||
2. May not provide the best reading experience for flowing documents
|
||||
|
||||
Two PDF generation modes are needed:
|
||||
1. **Layout Preservation** (existing): Maintains original coordinates
|
||||
2. **Reflow Layout** (new): Prioritizes readability with flowing content
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Translated and non-translated documents can use reflow layout
|
||||
- Both OCR and Direct tracks supported
|
||||
- Proper reading order preserved using available data
|
||||
- Consistent font sizes for readability
|
||||
- Images and tables embedded inline
|
||||
|
||||
**Non-Goals:**
|
||||
- Perfect visual matching with original document layout
|
||||
- Complex multi-column reflow (simple single-column flow)
|
||||
- Font style matching from original document
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: Reading Order Strategy
|
||||
|
||||
| Track | Reading Order Source | Implementation |
|
||||
|-------|---------------------|----------------|
|
||||
| **OCR** | Explicit `reading_order` array in JSON | Use array indices to order elements |
|
||||
| **Direct** | Implicit in element list order | Use list iteration order (PyMuPDF sort=True) |
|
||||
|
||||
**OCR Track - reading_order array:**
|
||||
```json
|
||||
{
|
||||
"pages": [{
|
||||
"reading_order": [0, 1, 2, 3, 6, 7, 8, ...],
|
||||
"elements": [...]
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
**Direct Track - implicit order:**
|
||||
- PyMuPDF's `get_text("dict", sort=True)` provides spatial reading order
|
||||
- Elements already sorted by extraction engine
|
||||
- Optional: Enable `_sort_elements_for_reading_order()` for multi-column detection
|
||||
|
||||
### Decision 2: Separate API Endpoints
|
||||
|
||||
```
|
||||
# Layout preservation (existing)
|
||||
GET /api/v2/tasks/{task_id}/download/pdf
|
||||
|
||||
# Reflow layout (new)
|
||||
GET /api/v2/tasks/{task_id}/download/pdf?format=reflow
|
||||
|
||||
# Translated PDF (reflow only)
|
||||
POST /api/v2/translate/{task_id}/pdf?lang={lang}
|
||||
```
|
||||
|
||||
### Decision 3: Unified Reflow Generation Method
|
||||
|
||||
```python
|
||||
def generate_reflow_pdf(
|
||||
self,
|
||||
result_json_path: Path,
|
||||
output_path: Path,
|
||||
translation_json_path: Optional[Path] = None, # None = no translation
|
||||
source_file_path: Optional[Path] = None, # For embedded images
|
||||
) -> bool:
|
||||
"""
|
||||
Generate reflow layout PDF for either OCR or Direct track.
|
||||
Works with or without translation.
|
||||
"""
|
||||
```
|
||||
|
||||
### Decision 4: Reading Order Application
|
||||
|
||||
```python
|
||||
def _get_elements_in_reading_order(self, page_data: dict) -> List[dict]:
|
||||
"""Get elements sorted by reading order."""
|
||||
elements = page_data.get('elements', [])
|
||||
reading_order = page_data.get('reading_order')
|
||||
|
||||
if reading_order:
|
||||
# OCR track: use explicit reading order
|
||||
ordered = []
|
||||
for idx in reading_order:
|
||||
if 0 <= idx < len(elements):
|
||||
ordered.append(elements[idx])
|
||||
return ordered
|
||||
else:
|
||||
# Direct track: elements already in reading order
|
||||
return elements
|
||||
```
|
||||
|
||||
### Decision 5: Consistent Typography
|
||||
|
||||
| Element Type | Font Size | Style |
|
||||
|-------------|-----------|-------|
|
||||
| Title/H1 | 18pt | Bold |
|
||||
| H2 | 16pt | Bold |
|
||||
| H3 | 14pt | Bold |
|
||||
| Body text | 12pt | Normal|
|
||||
| Table cell | 10pt | Normal|
|
||||
| Caption | 10pt | Italic|
|
||||
|
||||
### Decision 6: Table Handling in Reflow
|
||||
|
||||
Tables use Platypus Table with auto-width columns:
|
||||
|
||||
```python
|
||||
def _create_reflow_table(self, table_data, translations=None):
|
||||
data = []
|
||||
for row in table_data['rows']:
|
||||
row_data = []
|
||||
for cell in row['cells']:
|
||||
text = cell.get('text', '')
|
||||
if translations:
|
||||
text = translations.get(cell.get('id'), text)
|
||||
row_data.append(Paragraph(text, self.styles['TableCell']))
|
||||
data.append(row_data)
|
||||
|
||||
table = Table(data)
|
||||
table.setStyle(TableStyle([
|
||||
('GRID', (0, 0), (-1, -1), 0.5, colors.black),
|
||||
('VALIGN', (0, 0), (-1, -1), 'TOP'),
|
||||
('PADDING', (0, 0), (-1, -1), 6),
|
||||
]))
|
||||
return table
|
||||
```
|
||||
|
||||
### Decision 7: Image Embedding
|
||||
|
||||
```python
|
||||
def _embed_image_reflow(self, element, max_width=450):
|
||||
img_path = self._resolve_image_path(element)
|
||||
if img_path and img_path.exists():
|
||||
img = Image(str(img_path))
|
||||
# Scale to fit page width
|
||||
if img.drawWidth > max_width:
|
||||
ratio = max_width / img.drawWidth
|
||||
img.drawWidth = max_width
|
||||
img.drawHeight *= ratio
|
||||
return img
|
||||
return Spacer(1, 0)
|
||||
```
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
- **Risk**: OCR reading_order may not be accurate for complex layouts
|
||||
- **Mitigation**: Falls back to spatial sort if reading_order missing
|
||||
|
||||
- **Risk**: Direct track multi-column detection unused
|
||||
- **Mitigation**: PyMuPDF sort=True is generally reliable
|
||||
|
||||
- **Risk**: Loss of visual fidelity compared to original
|
||||
- **Mitigation**: This is acceptable; layout PDF still available
|
||||
|
||||
## Migration Plan
|
||||
|
||||
No migration needed - new functionality, existing behavior unchanged.
|
||||
|
||||
## Open Questions
|
||||
|
||||
None - design confirmed with user.
|
||||
Reference in New Issue
Block a user