test

2025-12-04 18:00:37 +08:00
parent 9437387ef1
commit 8265be1741
22 changed files with 2672 additions and 196 deletions
--- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md
+++ b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md
@@ -0,0 +1,167 @@
+## Context
+
+The PDF generator currently uses layout preservation mode for all PDF output, placing text at original coordinates. This works for document reconstruction but:
+1. Fails for translated content where text length differs significantly
+2. May not provide the best reading experience for flowing documents
+
+Two PDF generation modes are needed:
+1. **Layout Preservation** (existing): Maintains original coordinates
+2. **Reflow Layout** (new): Prioritizes readability with flowing content
+
+## Goals / Non-Goals
+
+**Goals:**
+- Translated and non-translated documents can use reflow layout
+- Both OCR and Direct tracks supported
+- Proper reading order preserved using available data
+- Consistent font sizes for readability
+- Images and tables embedded inline
+
+**Non-Goals:**
+- Perfect visual matching with original document layout
+- Complex multi-column reflow (simple single-column flow)
+- Font style matching from original document
+
+## Decisions
+
+### Decision 1: Reading Order Strategy
+
+| Track | Reading Order Source | Implementation |
+|-------|---------------------|----------------|
+| **OCR** | Explicit `reading_order` array in JSON | Use array indices to order elements |
+| **Direct** | Implicit in element list order | Use list iteration order (PyMuPDF sort=True) |
+
+**OCR Track - reading_order array:**
+```json
+{
+  "pages": [{
+    "reading_order": [0, 1, 2, 3, 6, 7, 8, ...],
+    "elements": [...]
+  }]
+}
+```
+
+**Direct Track - implicit order:**
+- PyMuPDF's `get_text("dict", sort=True)` provides spatial reading order
+- Elements already sorted by extraction engine
+- Optional: Enable `_sort_elements_for_reading_order()` for multi-column detection
+
+### Decision 2: Separate API Endpoints
+
+```
+# Layout preservation (existing)
+GET /api/v2/tasks/{task_id}/download/pdf
+
+# Reflow layout (new)
+GET /api/v2/tasks/{task_id}/download/pdf?format=reflow
+
+# Translated PDF (reflow only)
+POST /api/v2/translate/{task_id}/pdf?lang={lang}
+```
+
+### Decision 3: Unified Reflow Generation Method
+
+```python
+def generate_reflow_pdf(
+    self,
+    result_json_path: Path,
+    output_path: Path,
+    translation_json_path: Optional[Path] = None,  # None = no translation
+    source_file_path: Optional[Path] = None,       # For embedded images
+) -> bool:
+    """
+    Generate reflow layout PDF for either OCR or Direct track.
+    Works with or without translation.
+    """
+```
+
+### Decision 4: Reading Order Application
+
+```python
+def _get_elements_in_reading_order(self, page_data: dict) -> List[dict]:
+    """Get elements sorted by reading order."""
+    elements = page_data.get('elements', [])
+    reading_order = page_data.get('reading_order')
+
+    if reading_order:
+        # OCR track: use explicit reading order
+        ordered = []
+        for idx in reading_order:
+            if 0 <= idx < len(elements):
+                ordered.append(elements[idx])
+        return ordered
+    else:
+        # Direct track: elements already in reading order
+        return elements
+```
+
+### Decision 5: Consistent Typography
+
+| Element Type | Font Size | Style |
+|-------------|-----------|-------|
+| Title/H1    | 18pt      | Bold  |
+| H2          | 16pt      | Bold  |
+| H3          | 14pt      | Bold  |
+| Body text   | 12pt      | Normal|
+| Table cell  | 10pt      | Normal|
+| Caption     | 10pt      | Italic|
+
+### Decision 6: Table Handling in Reflow
+
+Tables use Platypus Table with auto-width columns:
+
+```python
+def _create_reflow_table(self, table_data, translations=None):
+    data = []
+    for row in table_data['rows']:
+        row_data = []
+        for cell in row['cells']:
+            text = cell.get('text', '')
+            if translations:
+                text = translations.get(cell.get('id'), text)
+            row_data.append(Paragraph(text, self.styles['TableCell']))
+        data.append(row_data)
+
+    table = Table(data)
+    table.setStyle(TableStyle([
+        ('GRID', (0, 0), (-1, -1), 0.5, colors.black),
+        ('VALIGN', (0, 0), (-1, -1), 'TOP'),
+        ('PADDING', (0, 0), (-1, -1), 6),
+    ]))
+    return table
+```
+
+### Decision 7: Image Embedding
+
+```python
+def _embed_image_reflow(self, element, max_width=450):
+    img_path = self._resolve_image_path(element)
+    if img_path and img_path.exists():
+        img = Image(str(img_path))
+        # Scale to fit page width
+        if img.drawWidth > max_width:
+            ratio = max_width / img.drawWidth
+            img.drawWidth = max_width
+            img.drawHeight *= ratio
+        return img
+    return Spacer(1, 0)
+```
+
+## Risks / Trade-offs
+
+- **Risk**: OCR reading_order may not be accurate for complex layouts
+  - **Mitigation**: Falls back to spatial sort if reading_order missing
+
+- **Risk**: Direct track multi-column detection unused
+  - **Mitigation**: PyMuPDF sort=True is generally reliable
+
+- **Risk**: Loss of visual fidelity compared to original
+  - **Mitigation**: This is acceptable; layout PDF still available
+
+## Migration Plan
+
+No migration needed - new functionality, existing behavior unchanged.
+
+## Open Questions
+
+None - design confirmed with user.