test

2025-12-04 18:00:37 +08:00
parent 9437387ef1
commit 8265be1741
22 changed files with 2672 additions and 196 deletions
--- a/PLAN.md
+++ b/PLAN.md
@@ -0,0 +1,186 @@
 # PDF 處理雙軌制改善計劃 (修訂版 v5)
 ## 問題分析
 ### 一、Direct Track 表格問題
 | 指標 | edit.pdf | edit3.pdf |
 |------|----------|-----------|
 | 原始表格結構 | 6 rows x 2 cols | 12 rows x 17 cols |
 | PyMuPDF 識別的 cells | 12 (無合併) | **83** (有121個合併) |
 | Direct Track 提取的 cells | 12 | **204** (全部視為1x1) |
 | 跨欄/跨行識別 | 不需要 | **❌ 完全未識別** |
 | 渲染結果 | ✓ 完美 | ❌ 欄位切分錯誤、文字超出 |
 **根因**: `_detect_tables_by_position()` 無法識別合併單元格
 ### 二、Direct Track 圖片問題 (edit3.pdf)
 | 問題 | 數量 | 說明 |
 |------|------|------|
 | 極小裝飾圖片 | 3 | < 200 px²，應過濾 |
 | 覆蓋圖像 (黑框) | 6 | 已檢測但未從渲染中移除 |
 | 大型 vector_graphics | 3 | ✓ 已正確過濾 |
 ### 三、OCR Track 表格問題
 | 表格 | cells | cell_boxes | cell_boxes 坐標檢查 |
 |------|-------|------------|-------------------|
 | pp3_0_3 | 13 | 13 | ⚠️ 1/5 超出範圍 |
 | pp3_0_6 | 29 | 12 | ❌ 全部超出範圍 |
 | pp3_0_7 | 12 | 51 | ❌ 全部超出範圍 |
 | pp3_0_16 | 51 | 29 | ❌ 全部超出範圍 |
 **根因**: PP-StructureV3 的 cell_boxes 座標系統錯亂
 ### 四、OCR Track 圖片問題 ❌ 嚴重
 | 文件 | 圖片元素 | PP-Structure 原始數據 | 轉換後 UnifiedDocument | 結果 |
 |------|---------|---------------------|----------------------|------|
 | edit.pdf | pp3_1_8 | saved_path="pp3_1_8.png" ✓ | content=字符串 ❌ | 圖片未放回 |
 | edit3.pdf | pp3_1_2 | saved_path="pp3_1_2.png" ✓ | content=字符串 ❌ | 圖片未放回 |
 **根因**: `ocr_to_unified_converter.py` 的 `_convert_pp3_element` 方法中：
 ```python
 # 當前代碼 (第604-613行)
 elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
    content = {'path': elem_data.get('img_path', ''), ...}
 else:
    content = elem_data.get('content', '')  # ← CHART 類型走這裡！
 ```
 **問題**:
 1. `CHART` 類型未被視為視覺元素
 2. `saved_path` 完全丟失
 3. `content` 變成文字而非圖片路徑
 ---
 ## 改善計劃
 ### 階段 1: Direct Track 使用 PyMuPDF find_tables (優先級：最高)
 **問題**: `_detect_tables_by_position` 無法識別合併單元格
 **方案**: 改用 PyMuPDF 的 `find_tables()` API
 **檔案**: `backend/app/services/direct_extraction_engine.py`
 ```python
 def _extract_tables_with_pymupdf(self, page, page_num, counter):
    tables = page.find_tables()
    for table in tables.tables:
        # 獲取 cells，保留合併信息
        cells = []
        for row_idx in range(table.row_count):
            for col_idx in range(table.col_count):
                cell_data = table.cells[row_idx * table.col_count + col_idx]
                if cell_data is None:
                    continue  # 跳過被合併的單元格
                # 計算 row_span/col_span...
 ```
 ### 階段 2: 修復 OCR Track 圖片路徑丟失 (優先級：最高)
 **問題**: CHART 類型的 saved_path 在轉換時丟失
 **檔案**: `backend/app/services/ocr_to_unified_converter.py`
 **位置**: `_convert_pp3_element` 方法，約第604行
 **修改**:
 ```python
 # 修改前
 elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
 # 修改後：包含所有視覺元素類型
 elif element_type in [
    ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART,
    ElementType.DIAGRAM, ElementType.LOGO, ElementType.STAMP
 ]:
    # 優先使用 saved_path
    image_path = (
        elem_data.get('saved_path') or
        elem_data.get('img_path') or
        ''
    )
    content = {
        'saved_path': image_path,  # 關鍵：保留 saved_path
        'path': image_path,
        'width': elem_data.get('width', 0),
        'height': elem_data.get('height', 0),
        'format': elem_data.get('format', 'unknown')
    }
 ```
 ### 階段 3: 修復 OCR Track cell_boxes 座標 (優先級：高)
 **方案**: 驗證座標，超出範圍時使用 CV 線檢測 fallback
 ### 階段 4: 過濾極小裝飾圖片 (優先級：高)
 ```python
 if elem_area < 200:
    continue  # 跳過 < 200 px² 的圖片
 ```
 ### 階段 5: 過濾覆蓋圖像 (優先級：高)
 在提取階段過濾與 covering_images 重疊的圖片。
 ---
 ## 實施優先級
 | 階段 | 描述 | 優先級 | 影響 |
 |------|------|--------|------|
 | 1 | Direct Track 使用 PyMuPDF find_tables | **最高** | 修復合併單元格 |
 | 2 | **OCR Track 圖片路徑修復** | **最高** | 修復圖片未放回 |
 | 3 | OCR Track cell_boxes 座標修復 | 高 | 修復表格渲染錯亂 |
 | 4 | 過濾極小裝飾圖片 | 高 | 減少無意義圖片 |
 | 5 | 過濾覆蓋圖像 | 高 | 減少黑框 |
 ---
 ## 預期效果
 ### Direct Track
 | 指標 | 修改前 | 修改後 |
 |------|--------|--------|
 | edit3.pdf cells | 204 (錯誤拆分) | 83 (正確識別合併) |
 | 跨欄/跨行識別 | ❌ | ✓ |
 ### OCR Track 圖片
 | 指標 | 修改前 | 修改後 |
 |------|--------|--------|
 | pp3_1_8 (edit.pdf) | 圖片未放回 | ✓ 正確放回 |
 | pp3_1_2 (edit3.pdf) | 圖片未放回 | ✓ 正確放回 |
 ### OCR Track 表格
 | 指標 | 修改前 | 修改後 |
 |------|--------|--------|
 | cell_boxes 座標 | 3/5 表格錯誤 | 全部正確或 CV fallback |
 ---
 ## 測試計劃
 1. **edit.pdf Direct Track**: 確保無回歸
 2. **edit3.pdf Direct Track**:
   - 驗證表格識別到 83 cells（非 204）
   - 驗證跨欄/跨行正確
   - 驗證極小圖片被過濾
   - 驗證黑框被過濾
 3. **edit.pdf OCR Track**:
   - **驗證 pp3_1_8.png 正確放回**
   - 驗證 cell_boxes 座標修復
 4. **edit3.pdf OCR Track**:
   - **驗證 pp3_1_2.png 正確放回**
   - 驗證 cell_boxes 座標修復
--- a/backend/app/services/direct_extraction_engine.py
+++ b/backend/app/services/direct_extraction_engine.py
--- a/backend/app/services/ocr_to_unified_converter.py
+++ b/backend/app/services/ocr_to_unified_converter.py
@@ -178,6 +178,114 @@ def trim_empty_columns(table_dict: Dict[str, Any]) -> Dict[str, Any]:
    return result
 def validate_cell_boxes(
    cell_boxes: List[List[float]],
    table_bbox: List[float],
    page_width: float,
    page_height: float,
    tolerance: float = 5.0
 ) -> Dict[str, Any]:
    """
    Validate cell_boxes coordinates against page boundaries and table bbox.
    PP-StructureV3 sometimes returns cell_boxes with coordinates that exceed
    page boundaries. This function validates and reports issues.
    Args:
        cell_boxes: List of cell bounding boxes [[x0, y0, x1, y1], ...]
        table_bbox: Table bounding box [x0, y0, x1, y1]
        page_width: Page width in pixels
        page_height: Page height in pixels
        tolerance: Allowed tolerance for boundary checks (pixels)
    Returns:
        Dict with:
            - valid: bool - whether all cell_boxes are valid
            - invalid_count: int - number of invalid cell_boxes
            - clamped_boxes: List - cell_boxes clamped to valid boundaries
            - issues: List[str] - description of issues found
    """
    if not cell_boxes:
        return {'valid': True, 'invalid_count': 0, 'clamped_boxes': [], 'issues': []}
    issues = []
    invalid_count = 0
    clamped_boxes = []
    # Page boundaries with tolerance
    min_x = -tolerance
    min_y = -tolerance
    max_x = page_width + tolerance
    max_y = page_height + tolerance
    for idx, box in enumerate(cell_boxes):
        if not box or len(box) < 4:
            issues.append(f"Cell {idx}: Invalid box format")
            invalid_count += 1
            clamped_boxes.append([0, 0, 0, 0])
            continue
        x0, y0, x1, y1 = box[:4]
        is_valid = True
        cell_issues = []
        # Check if coordinates exceed page boundaries
        if x0 < min_x:
            cell_issues.append(f"x0={x0:.1f} < 0")
            is_valid = False
        if y0 < min_y:
            cell_issues.append(f"y0={y0:.1f} < 0")
            is_valid = False
        if x1 > max_x:
            cell_issues.append(f"x1={x1:.1f} > page_width={page_width:.1f}")
            is_valid = False
        if y1 > max_y:
            cell_issues.append(f"y1={y1:.1f} > page_height={page_height:.1f}")
            is_valid = False
        # Check for inverted coordinates
        if x0 > x1:
            cell_issues.append(f"x0={x0:.1f} > x1={x1:.1f}")
            is_valid = False
        if y0 > y1:
            cell_issues.append(f"y0={y0:.1f} > y1={y1:.1f}")
            is_valid = False
        if not is_valid:
            invalid_count += 1
            issues.append(f"Cell {idx}: {', '.join(cell_issues)}")
        # Clamp to valid boundaries
        clamped_box = [
            max(0, min(x0, page_width)),
            max(0, min(y0, page_height)),
            max(0, min(x1, page_width)),
            max(0, min(y1, page_height))
        ]
        # Ensure proper ordering after clamping
        if clamped_box[0] > clamped_box[2]:
            clamped_box[0], clamped_box[2] = clamped_box[2], clamped_box[0]
        if clamped_box[1] > clamped_box[3]:
            clamped_box[1], clamped_box[3] = clamped_box[3], clamped_box[1]
        clamped_boxes.append(clamped_box)
    if invalid_count > 0:
        logger.warning(
            f"Cell boxes validation: {invalid_count}/{len(cell_boxes)} invalid. "
            f"Page: {page_width:.0f}x{page_height:.0f}, Table bbox: {table_bbox}"
        )
    return {
        'valid': invalid_count == 0,
        'invalid_count': invalid_count,
        'clamped_boxes': clamped_boxes,
        'issues': issues,
        'needs_fallback': invalid_count > len(cell_boxes) * 0.5  # >50% invalid = needs fallback
    }
 class OCRToUnifiedConverter:
    """
    Converter for transforming PP-StructureV3 OCR results to UnifiedDocument format.
@@ -337,19 +445,22 @@ class OCRToUnifiedConverter:
        for page_idx, page_result in enumerate(enhanced_results):
            elements = []
            # Get page dimensions first (needed for element conversion)
            page_width = page_result.get('width', 0)
            page_height = page_result.get('height', 0)
            pp_dimensions = Dimensions(width=page_width, height=page_height)
            # Process elements from parsing_res_list
            if 'elements' in page_result:
                for elem_data in page_result['elements']:
-                    element = self._convert_pp3_element(elem_data, page_idx)
+                    element = self._convert_pp3_element(
                        elem_data, page_idx,
                        page_width=page_width,
                        page_height=page_height
                    )
                    if element:
                        elements.append(element)
            # Get page dimensions
            pp_dimensions = Dimensions(
                width=page_result.get('width', 0),
                height=page_result.get('height', 0)
            )
            # Apply gap filling if enabled and raw regions available
            if self.gap_filling_service and raw_text_regions:
                # Filter raw regions for current page
@@ -556,9 +667,19 @@ class OCRToUnifiedConverter:
    def _convert_pp3_element(
        self,
        elem_data: Dict[str, Any],
-        page_idx: int
+        page_idx: int,
        page_width: float = 0,
        page_height: float = 0
    ) -> Optional[DocumentElement]:
-        """Convert PP-StructureV3 element to DocumentElement."""
+        """
        Convert PP-StructureV3 element to DocumentElement.
        Args:
            elem_data: Element data from PP-StructureV3
            page_idx: Page index (0-based)
            page_width: Page width for coordinate validation
            page_height: Page height for coordinate validation
        """
        try:
            # Extract bbox
            bbox_data = elem_data.get('bbox', [0, 0, 0, 0])
@@ -597,18 +718,67 @@ class OCRToUnifiedConverter:
                # Preserve cell_boxes and embedded_images in metadata for PDF generation
                # These are extracted by PP-StructureV3 and provide accurate cell positioning
                if 'cell_boxes' in elem_data:
-                    elem_data.setdefault('metadata', {})['cell_boxes'] = elem_data['cell_boxes']
+                    cell_boxes = elem_data['cell_boxes']
-                    elem_data['metadata']['cell_boxes_source'] = elem_data.get('cell_boxes_source', 'table_res_list')
+                    elem_data.setdefault('metadata', {})['cell_boxes_source'] = elem_data.get('cell_boxes_source', 'table_res_list')
                    # Validate cell_boxes coordinates if page dimensions are available
                    if page_width > 0 and page_height > 0:
                        validation = validate_cell_boxes(
                            cell_boxes=cell_boxes,
                            table_bbox=bbox_data,
                            page_width=page_width,
                            page_height=page_height
                        )
                        if not validation['valid']:
                            elem_data['metadata']['cell_boxes_validation'] = {
                                'valid': False,
                                'invalid_count': validation['invalid_count'],
                                'total_count': len(cell_boxes),
                                'needs_fallback': validation['needs_fallback']
                            }
                            # Use clamped boxes instead of invalid ones
                            elem_data['metadata']['cell_boxes'] = validation['clamped_boxes']
                            elem_data['metadata']['cell_boxes_original'] = cell_boxes
                            if validation['needs_fallback']:
                                logger.warning(
                                    f"Table {elem_data.get('element_id')}: "
                                    f"{validation['invalid_count']}/{len(cell_boxes)} cell_boxes invalid, "
                                    f"fallback recommended"
                                )
                        else:
                            elem_data['metadata']['cell_boxes'] = cell_boxes
                            elem_data['metadata']['cell_boxes_validation'] = {'valid': True}
                    else:
                        # No page dimensions available, store as-is
                        elem_data['metadata']['cell_boxes'] = cell_boxes
                if 'embedded_images' in elem_data:
                    elem_data.setdefault('metadata', {})['embedded_images'] = elem_data['embedded_images']
-            elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
+            elif element_type in [
-                # For images, use metadata dict as content
+                ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART,
                ElementType.DIAGRAM, ElementType.LOGO, ElementType.STAMP
            ]:
                # For all visual elements, use metadata dict as content
                # Priority: saved_path > img_path (PP-StructureV3 uses saved_path)
                image_path = (
                    elem_data.get('saved_path') or
                    elem_data.get('img_path') or
                    ''
                )
                content = {
-                    'path': elem_data.get('img_path', ''),
+                    'saved_path': image_path,  # Preserve original path key
                    'path': image_path,        # For backward compatibility
                    'width': elem_data.get('width', 0),
                    'height': elem_data.get('height', 0),
                    'format': elem_data.get('format', 'unknown')
                }
                if not image_path:
                    logger.warning(
                        f"Visual element {element_type.value} missing image path: "
                        f"saved_path={elem_data.get('saved_path')}, img_path={elem_data.get('img_path')}"
                    )
            else:
                content = elem_data.get('content', '')
@@ -1139,10 +1309,18 @@ class OCRToUnifiedConverter:
        for page_idx, page_data in enumerate(pages_data):
            elements = []
            # Get page dimensions first
            page_width = page_data.get('width', 0)
            page_height = page_data.get('height', 0)
            # Process each element in the page
            if 'elements' in page_data:
                for elem_data in page_data['elements']:
-                    element = self._convert_pp3_element(elem_data, page_idx)
+                    element = self._convert_pp3_element(
                        elem_data, page_idx,
                        page_width=page_width,
                        page_height=page_height
                    )
                    if element:
                        elements.append(element)
@@ -1150,8 +1328,8 @@ class OCRToUnifiedConverter:
            page = Page(
                page_number=page_idx + 1,
                dimensions=Dimensions(
-                    width=page_data.get('width', 0),
+                    width=page_width,
-                    height=page_data.get('height', 0)
+                    height=page_height
                ),
                elements=elements,
                metadata={'reading_order': self._calculate_reading_order(elements)}
--- a/backend/app/services/pdf_generator_service.py
+++ b/backend/app/services/pdf_generator_service.py
@@ -3371,18 +3371,21 @@ class PDFGeneratorService:
            "rows": 6,
            "cols": 2,
            "cells": [
-                {"row": 0, "col": 0, "content": "..."},
+                {"row": 0, "col": 0, "content": "...", "row_span": 1, "col_span": 2},
                {"row": 0, "col": 1, "content": "..."},
                ...
            ]
        }
-        Returns format compatible with HTMLTableParser output:
+        Returns format compatible with HTMLTableParser output (with colspan/rowspan/col):
        [
-            {"cells": [{"text": "..."}, {"text": "..."}]},  # row 0
+            {"cells": [{"text": "...", "colspan": 1, "rowspan": 1, "col": 0}, ...]},
-            {"cells": [{"text": "..."}, {"text": "..."}]},  # row 1
+            {"cells": [{"text": "...", "colspan": 1, "rowspan": 1, "col": 0}, ...]},
            ...
        ]
        Note: This returns actual cells per row with their absolute column positions.
        The table renderer uses 'col' to place cells correctly in the grid.
        """
        try:
            num_rows = content.get('rows', 0)
@@ -3392,21 +3395,39 @@ class PDFGeneratorService:
            if not cells or num_rows == 0 or num_cols == 0:
                return []
-            # Initialize rows structure
+            # Group cells by row
-            rows_data = []
+            cells_by_row = {}
            for _ in range(num_rows):
                rows_data.append({'cells': [{'text': ''} for _ in range(num_cols)]})
            # Fill in cell content
            for cell in cells:
                row_idx = cell.get('row', 0)
-                col_idx = cell.get('col', 0)
+                if row_idx not in cells_by_row:
                    cells_by_row[row_idx] = []
                cells_by_row[row_idx].append(cell)
            # Sort cells within each row by column
            for row_idx in cells_by_row:
                cells_by_row[row_idx].sort(key=lambda c: c.get('col', 0))
            # Build rows structure with colspan/rowspan info and absolute col position
            rows_data = []
            for row_idx in range(num_rows):
                row_cells = []
                if row_idx in cells_by_row:
                    for cell in cells_by_row[row_idx]:
                        cell_content = cell.get('content', '')
                        row_span = cell.get('row_span', 1) or 1
                        col_span = cell.get('col_span', 1) or 1
                        col_idx = cell.get('col', 0)
-                if 0 <= row_idx < num_rows and 0 <= col_idx < num_cols:
+                        row_cells.append({
-                    rows_data[row_idx]['cells'][col_idx]['text'] = str(cell_content) if cell_content else ''
+                            'text': str(cell_content) if cell_content else '',
                            'rowspan': row_span,
                            'colspan': col_span,
                            'col': col_idx  # Absolute column position
                        })
-            logger.debug(f"Built {num_rows} rows from cells dict")
+                rows_data.append({'cells': row_cells})
            logger.debug(f"Built {num_rows} rows from cells dict with span info")
            return rows_data
        except Exception as e:
@@ -3471,19 +3492,115 @@ class PDFGeneratorService:
            table_width = bbox.x1 - bbox.x0
            table_height = bbox.y1 - bbox.y0
            # Build table data for ReportLab
            table_content = []
            for row in rows:
                row_data = [cell['text'].strip() for cell in row['cells']]
                table_content.append(row_data)
            # Create table
            from reportlab.platypus import Table, TableStyle
            from reportlab.lib import colors
-            # Determine number of rows and columns for cell_boxes calculation
+            # Determine grid size from rows structure
            # Note: rows may have 'col' attribute for absolute positioning (from Direct extraction)
            # or may be sequential (from HTML parsing)
            num_rows = len(rows)
-            max_cols = max(len(row['cells']) for row in rows) if rows else 0
+
            # Check if cells have absolute column positions
            has_absolute_cols = any(
                'col' in cell
                for row in rows
                for cell in row['cells']
            )
            # Calculate actual number of columns
            max_cols = 0
            if has_absolute_cols:
                # Use absolute col positions + colspan to find max column
                for row in rows:
                    for cell in row['cells']:
                        col = cell.get('col', 0)
                        colspan = cell.get('colspan', 1)
                        max_cols = max(max_cols, col + colspan)
            else:
                # Sequential cells: sum up colspans
                for row in rows:
                    col_pos = 0
                    for cell in row['cells']:
                        colspan = cell.get('colspan', 1)
                        col_pos += colspan
                    max_cols = max(max_cols, col_pos)
            # Build table data for ReportLab with proper grid structure
            # ReportLab needs a full grid with placeholders for spanned cells
            # and SPAN commands to merge them
            table_content = []
            span_commands = []
            covered = set()  # Track cells covered by spans
            # First pass: mark covered cells and collect SPAN commands
            for row_idx, row in enumerate(rows):
                if has_absolute_cols:
                    # Use absolute column positions
                    for cell in row['cells']:
                        col_pos = cell.get('col', 0)
                        colspan = cell.get('colspan', 1)
                        rowspan = cell.get('rowspan', 1)
                        # Mark cells covered by this span
                        if colspan > 1 or rowspan > 1:
                            for r in range(row_idx, row_idx + rowspan):
                                for c in range(col_pos, col_pos + colspan):
                                    if (r, c) != (row_idx, col_pos):
                                        covered.add((r, c))
                            # Add SPAN command for ReportLab
                            span_commands.append((
                                'SPAN',
                                (col_pos, row_idx),
                                (col_pos + colspan - 1, row_idx + rowspan - 1)
                            ))
                else:
                    # Sequential positioning
                    col_pos = 0
                    for cell in row['cells']:
                        while (row_idx, col_pos) in covered:
                            col_pos += 1
                        colspan = cell.get('colspan', 1)
                        rowspan = cell.get('rowspan', 1)
                        if colspan > 1 or rowspan > 1:
                            for r in range(row_idx, row_idx + rowspan):
                                for c in range(col_pos, col_pos + colspan):
                                    if (r, c) != (row_idx, col_pos):
                                        covered.add((r, c))
                            span_commands.append((
                                'SPAN',
                                (col_pos, row_idx),
                                (col_pos + colspan - 1, row_idx + rowspan - 1)
                            ))
                        col_pos += colspan
            # Second pass: build content grid
            for row_idx in range(num_rows):
                row_data = [''] * max_cols
                if row_idx < len(rows):
                    if has_absolute_cols:
                        # Place cells at their absolute positions
                        for cell in rows[row_idx]['cells']:
                            col_pos = cell.get('col', 0)
                            if col_pos < max_cols:
                                row_data[col_pos] = cell['text'].strip()
                    else:
                        # Sequential placement
                        col_pos = 0
                        for cell in rows[row_idx]['cells']:
                            while col_pos < max_cols and (row_idx, col_pos) in covered:
                                col_pos += 1
                            if col_pos < max_cols:
                                row_data[col_pos] = cell['text'].strip()
                                colspan = cell.get('colspan', 1)
                                col_pos += colspan
                table_content.append(row_data)
            logger.debug(f"Built table grid: {num_rows} rows × {max_cols} cols, {len(span_commands)} span commands (absolute_cols={has_absolute_cols})")
            # Use original column widths from extraction if available
            # Otherwise try to compute from cell_boxes (from PP-StructureV3)
@@ -3517,7 +3634,7 @@ class PDFGeneratorService:
            # Apply style with minimal padding to reduce table extension
            # Use Chinese font to support special characters (℃, μm, ≦, ×, Ω, etc.)
            font_for_table = self.font_name if self.font_registered else 'Helvetica'
-            style = TableStyle([
+            style_commands = [
                ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
                ('FONTNAME', (0, 0), (-1, -1), font_for_table),
                ('FONTSIZE', (0, 0), (-1, -1), 8),
@@ -3529,7 +3646,13 @@ class PDFGeneratorService:
                ('BOTTOMPADDING', (0, 0), (-1, -1), 0),
                ('LEFTPADDING', (0, 0), (-1, -1), 1),
                ('RIGHTPADDING', (0, 0), (-1, -1), 1),
-            ])
+            ]
            # Add span commands for merged cells
            style_commands.extend(span_commands)
            if span_commands:
                logger.info(f"Applied {len(span_commands)} SPAN commands for merged cells")
            style = TableStyle(style_commands)
            t.setStyle(style)
            # Use canvas scaling as fallback to fit table within bbox
@@ -4350,30 +4473,97 @@ class PDFGeneratorService:
                        # Replace newlines with <br/>
                        safe_content = safe_content.replace('\n', '<br/>')
-                        # Calculate font size from bbox height, but keep minimum 10pt
+                        # Get original font size from style info
-                        font_size = max(box_height * 0.7, 10)
+                        style_info = elem.get('style', {})
-                        font_size = min(font_size, 24)  # Cap at 24pt
+                        original_font_size = style_info.get('font_size', 12.0)
-                        # Create style for this element
+                        # Detect vertical text (Y-axis labels, etc.)
-                        elem_style = ParagraphStyle(
+                        # Vertical text has aspect_ratio (height/width) > 2 and multiple characters
-                            f'elem_{id(elem)}',
+                        is_vertical_text = (
-                            parent=base_style,
+                            box_height > box_width * 2 and
-                            fontSize=font_size,
+                            len(content.strip()) > 1
-                            leading=font_size * 1.2,
+                        )
                        if is_vertical_text:
                            # For vertical text, use original font size and rotate
                            font_size = min(original_font_size, box_width * 0.9)
                            font_size = max(font_size, 6)  # Minimum 6pt
                            # Save canvas state for rotation
                            pdf_canvas.saveState()
                            # Convert to PDF coordinates
                            pdf_y_center = current_page_height - (y0 + y1) / 2
                            x_center = (x0 + x1) / 2
                            # Translate to center, rotate, translate back
                            pdf_canvas.translate(x_center, pdf_y_center)
                            pdf_canvas.rotate(90)
                            # Set font and draw text centered
                            pdf_canvas.setFont(
                                self.font_name if self.font_registered else 'Helvetica',
                                font_size
                            )
                            # Draw text at origin (since we translated to center)
                            text_width = pdf_canvas.stringWidth(
                                safe_content.replace('&amp;', '&').replace('&lt;', '<').replace('&gt;', '>'),
                                self.font_name if self.font_registered else 'Helvetica',
                                font_size
                            )
                            pdf_canvas.drawString(-text_width / 2, -font_size / 3,
                                safe_content.replace('&amp;', '&').replace('&lt;', '<').replace('&gt;', '>'))
                            pdf_canvas.restoreState()
                        else:
                            # For horizontal text, dynamically fit text within bbox
                            # Start with original font size and reduce until text fits
                            MIN_FONT_SIZE = 6
                            MAX_FONT_SIZE = 14
                            if original_font_size > 0:
                                start_font_size = min(original_font_size, MAX_FONT_SIZE)
                            else:
                                start_font_size = min(box_height * 0.7, MAX_FONT_SIZE)
                            font_size = max(start_font_size, MIN_FONT_SIZE)
                            # Try progressively smaller font sizes until text fits
                            para = None
                            para_height = box_height + 1  # Start with height > box to enter loop
                            while font_size >= MIN_FONT_SIZE and para_height > box_height:
                                elem_style = ParagraphStyle(
                                    f'elem_{id(elem)}_{font_size}',
                                    parent=base_style,
                                    fontSize=font_size,
                                    leading=font_size * 1.15,  # Tighter leading
                                )
                        # Create paragraph
                                para = Paragraph(safe_content, elem_style)
                                para_width, para_height = para.wrap(box_width, box_height * 3)
-                        # Calculate available width and height
+                                if para_height <= box_height:
-                        available_width = box_width
+                                    break  # Text fits!
                        available_height = box_height * 2  # Allow overflow
-                        # Wrap the paragraph
+                                font_size -= 0.5  # Reduce font size and try again
-                        para_width, para_height = para.wrap(available_width, available_height)
+
                            # Ensure minimum font size
                            if font_size < MIN_FONT_SIZE:
                                font_size = MIN_FONT_SIZE
                                elem_style = ParagraphStyle(
                                    f'elem_{id(elem)}_min',
                                    parent=base_style,
                                    fontSize=font_size,
                                    leading=font_size * 1.15,
                                )
                                para = Paragraph(safe_content, elem_style)
                                para_width, para_height = para.wrap(box_width, box_height * 3)
                            # Convert to PDF coordinates (y from bottom)
-                        pdf_y = current_page_height - y0 - para_height
+                            # Clip to bbox height to prevent overflow
                            actual_height = min(para_height, box_height)
                            pdf_y = current_page_height - y0 - actual_height
                            # Draw the paragraph
                            para.drawOn(pdf_canvas, x0, pdf_y)
@@ -4451,13 +4641,47 @@ class PDFGeneratorService:
            pdf_y_bottom = page_height - ty1
            pdf_canvas.rect(tx0, pdf_y_bottom, table_width, table_height, stroke=1, fill=0)
-            # Step 2: Draw cell borders using cell_boxes
+            # Step 2: Get or calculate cell boxes
            cell_boxes = metadata.get('cell_boxes', [])
-            if cell_boxes:
+
            # If no cell_boxes, calculate from column_widths and row_heights
            if not cell_boxes:
                column_widths = metadata.get('column_widths', [])
                row_heights = metadata.get('row_heights', [])
                if column_widths and row_heights:
                    # Calculate cell positions from widths and heights
                    cell_boxes = []
                    rows = content.get('rows', len(row_heights)) if isinstance(content, dict) else len(row_heights)
                    cols = content.get('cols', len(column_widths)) if isinstance(content, dict) else len(column_widths)
                    # Calculate cumulative positions
                    x_positions = [tx0]
                    for w in column_widths[:cols]:
                        x_positions.append(x_positions[-1] + w)
                    y_positions = [ty0]
                    for h in row_heights[:rows]:
                        y_positions.append(y_positions[-1] + h)
                    # Create cell boxes for each cell (row-major order)
                    for row_idx in range(rows):
                        for col_idx in range(cols):
                            if col_idx < len(x_positions) - 1 and row_idx < len(y_positions) - 1:
                                cx0 = x_positions[col_idx]
                                cy0 = y_positions[row_idx]
                                cx1 = x_positions[col_idx + 1]
                                cy1 = y_positions[row_idx + 1]
                                cell_boxes.append([cx0, cy0, cx1, cy1])
                    logger.debug(f"Calculated {len(cell_boxes)} cell boxes from {cols} cols x {rows} rows")
            # Normalize cell boxes for grid alignment
-                if hasattr(self, '_normalize_cell_boxes_to_grid'):
+            if cell_boxes and hasattr(self, '_normalize_cell_boxes_to_grid'):
                cell_boxes = self._normalize_cell_boxes_to_grid(cell_boxes)
            # Draw cell borders
            if cell_boxes:
                pdf_canvas.setLineWidth(0.5)
                for box in cell_boxes:
                    if len(box) >= 4:
--- a/backend/app/services/pp_structure_enhanced.py
+++ b/backend/app/services/pp_structure_enhanced.py
@@ -558,8 +558,8 @@ class PPStructureEnhanced:
                        element['embedded_images'] = embedded_images
                        logger.info(f"[TABLE] Embedded {len(embedded_images)} images into table")
-            # Special handling for images/figures/stamps (visual elements that need cropping)
+            # Special handling for images/figures/charts/stamps (visual elements that need cropping)
-            elif mapped_type in [ElementType.IMAGE, ElementType.FIGURE, ElementType.STAMP, ElementType.LOGO]:
+            elif mapped_type in [ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART, ElementType.DIAGRAM, ElementType.STAMP, ElementType.LOGO]:
                # Save image if path provided
                if 'img_path' in item and output_dir:
                    saved_path = self._save_image(item['img_path'], output_dir, element['element_id'])
--- a/backend/tests/debug_table_cells.py
+++ b/backend/tests/debug_table_cells.py
@@ -0,0 +1,43 @@
 """Debug PyMuPDF table.cells structure"""
 import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent.parent))
 import fitz
 pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
 doc = fitz.open(str(pdf_path))
 page = doc[0]
 tables = page.find_tables()
 for idx, table in enumerate(tables.tables):
    data = table.extract()
    num_rows = len(data)
    num_cols = max(len(row) for row in data) if data else 0
    print(f"Table {idx}:")
    print(f"  table.extract() dimensions: {num_rows} rows x {num_cols} cols")
    print(f"  Expected positions: {num_rows * num_cols}")
    cell_rects = getattr(table, 'cells', None)
    if cell_rects:
        print(f"  table.cells length: {len(cell_rects)}")
        none_count = sum(1 for c in cell_rects if c is None)
        actual_count = sum(1 for c in cell_rects if c is not None)
        print(f"  None cells: {none_count}")
        print(f"  Actual cells: {actual_count}")
        # Check if cell_rects matches grid size
        if len(cell_rects) != num_rows * num_cols:
            print(f"  WARNING: cell_rects length ({len(cell_rects)}) != grid size ({num_rows * num_cols})")
        # Show first few cells
        print(f"  First 5 cells: {cell_rects[:5]}")
    else:
        print(f"  table.cells: NOT AVAILABLE")
    # Check row_count and col_count
    print(f"  table.row_count: {getattr(table, 'row_count', 'N/A')}")
    print(f"  table.col_count: {getattr(table, 'col_count', 'N/A')}")
 doc.close()
--- a/backend/tests/debug_table_cells2.py
+++ b/backend/tests/debug_table_cells2.py
@@ -0,0 +1,48 @@
 """Debug PyMuPDF table structure - find merge info"""
 import sys
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent.parent))
 import fitz
 pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
 doc = fitz.open(str(pdf_path))
 page = doc[0]
 tables = page.find_tables()
 for idx, table in enumerate(tables.tables):
    print(f"\nTable {idx}:")
    # Check all available attributes
    print(f"  Available attributes: {[a for a in dir(table) if not a.startswith('_')]}")
    # Try to get header info
    if hasattr(table, 'header'):
        print(f"  header: {table.header}")
    # Check for cells info
    cell_rects = table.cells
    print(f"  cells count: {len(cell_rects)}")
    # Get the extracted data
    data = table.extract()
    print(f"  extract() shape: {len(data)} x {max(len(r) for r in data)}")
    # Check if there's a way to map cells to grid positions
    # Look at the pandas output which might have merge info
    try:
        df = table.to_pandas()
        print(f"  pandas shape: {df.shape}")
    except Exception as e:
        print(f"  pandas error: {e}")
    # Check the TableRow objects if available
    if hasattr(table, 'rows'):
        rows = table.rows
        print(f"  rows: {len(rows)}")
        for ri, row in enumerate(rows[:3]):  # first 3 rows
            print(f"    row {ri}: {len(row.cells)} cells")
            for ci, cell in enumerate(row.cells[:5]):  # first 5 cells
                print(f"      cell {ci}: bbox={cell}")
 doc.close()
--- a/backend/tests/generate_test_pdf.py
+++ b/backend/tests/generate_test_pdf.py
@@ -0,0 +1,111 @@
 """
 Generate test PDF to verify Phase 1 fixes
 """
 import sys
 import os
 from pathlib import Path
 # Add backend to path
 sys.path.insert(0, str(Path(__file__).parent.parent))
 from app.services.direct_extraction_engine import DirectExtractionEngine
 from app.services.pdf_generator_service import PDFGeneratorService
 from app.services.unified_document_exporter import UnifiedDocumentExporter
 def generate_test_pdf(input_pdf: str, output_dir: Path):
    """Generate test PDF using Direct Track extraction"""
    input_path = Path(input_pdf)
    output_dir.mkdir(parents=True, exist_ok=True)
    print(f"Processing: {input_path.name}")
    print(f"Output dir: {output_dir}")
    # Step 1: Extract with Direct Track
    engine = DirectExtractionEngine(
        enable_table_detection=True,
        enable_image_extraction=True,
        min_image_area=200.0,  # Filter tiny images
        enable_whiteout_detection=True,
        enable_content_sanitization=True
    )
    unified_doc = engine.extract(input_path, output_dir=output_dir)
    # Print extraction stats
    print(f"\n=== Extraction Results ===")
    print(f"Document ID: {unified_doc.document_id}")
    print(f"Pages: {len(unified_doc.pages)}")
    table_count = 0
    image_count = 0
    merged_cells = 0
    total_cells = 0
    for page in unified_doc.pages:
        for elem in page.elements:
            if elem.type.value == 'table':
                table_count += 1
                if elem.content and hasattr(elem.content, 'cells'):
                    total_cells += len(elem.content.cells)
                    for cell in elem.content.cells:
                        if cell.row_span > 1 or cell.col_span > 1:
                            merged_cells += 1
            elif elem.type.value == 'image':
                image_count += 1
    print(f"Tables: {table_count}")
    print(f"  - Total cells: {total_cells}")
    print(f"  - Merged cells: {merged_cells}")
    print(f"Images: {image_count}")
    # Step 2: Export to JSON
    exporter = UnifiedDocumentExporter()
    json_path = output_dir / f"{input_path.stem}_result.json"
    exporter.export_to_json(unified_doc, json_path)
    print(f"\nJSON saved: {json_path}")
    # Step 3: Generate layout PDF
    pdf_generator = PDFGeneratorService()
    pdf_path = output_dir / f"{input_path.stem}_layout.pdf"
    try:
        pdf_generator.generate_from_unified_document(
            unified_doc=unified_doc,
            output_path=pdf_path,
            source_file_path=input_path
        )
        print(f"PDF saved: {pdf_path}")
        return pdf_path
    except Exception as e:
        print(f"PDF generation error: {e}")
        import traceback
        traceback.print_exc()
        return None
 if __name__ == "__main__":
    # Test with edit3.pdf (has complex tables with merging)
    demo_docs = Path(__file__).parent.parent.parent / "demo_docs"
    output_base = Path(__file__).parent.parent / "storage" / "test_phase1"
    # Process edit3.pdf
    edit3_pdf = demo_docs / "edit3.pdf"
    if edit3_pdf.exists():
        output_dir = output_base / "edit3"
        result = generate_test_pdf(str(edit3_pdf), output_dir)
        if result:
            print(f"\n✓ Test PDF generated: {result}")
    # Also process edit.pdf for comparison
    edit_pdf = demo_docs / "edit.pdf"
    if edit_pdf.exists():
        output_dir = output_base / "edit"
        result = generate_test_pdf(str(edit_pdf), output_dir)
        if result:
            print(f"\n✓ Test PDF generated: {result}")
    print(f"\n=== Output Location ===")
    print(f"{output_base}")
--- a/backend/tests/test_phase1_fixes.py
+++ b/backend/tests/test_phase1_fixes.py
@@ -0,0 +1,285 @@
 """
 Phase 1 Bug Fixes Verification Tests
 Tests for:
 1.1 Direct Track table cell merging
 1.2 OCR Track image path preservation
 1.3 Cell boxes coordinate validation
 1.4 Tiny decoration image filtering
 1.5 Covering image removal
 """
 import sys
 import os
 from pathlib import Path
 # Add backend to path
 sys.path.insert(0, str(Path(__file__).parent.parent))
 import fitz
 from app.services.direct_extraction_engine import DirectExtractionEngine
 from app.services.ocr_to_unified_converter import validate_cell_boxes
 from app.models.unified_document import TableCell
 def test_1_1_table_cell_merging():
    """Test 1.1.5: Verify edit3.pdf returns correct merged cells"""
    print("\n" + "="*60)
    print("TEST 1.1: Direct Track Table Cell Merging")
    print("="*60)
    pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
    if not pdf_path.exists():
        print(f"SKIP: {pdf_path} not found")
        return False
    doc = fitz.open(str(pdf_path))
    total_cells = 0
    merged_cells = 0
    for page_num, page in enumerate(doc):
        tables = page.find_tables()
        for table_idx, table in enumerate(tables.tables):
            data = table.extract()
            cell_rects = getattr(table, 'cells', None)
            if cell_rects:
                num_rows = len(data)
                num_cols = max(len(row) for row in data) if data else 0
                # Count actual cells (non-None)
                actual_cells = sum(1 for c in cell_rects if c is not None)
                none_cells = sum(1 for c in cell_rects if c is None)
                print(f"  Page {page_num}, Table {table_idx}:")
                print(f"    Grid size: {num_rows} x {num_cols} = {num_rows * num_cols} positions")
                print(f"    Actual cells: {actual_cells}")
                print(f"    Merged positions (None): {none_cells}")
                total_cells += actual_cells
                if none_cells > 0:
                    merged_cells += 1
    doc.close()
    print(f"\n  Total actual cells across all tables: {total_cells}")
    print(f"  Tables with merging: {merged_cells}")
    # According to PLAN.md, edit3.pdf should have 83 cells (not 204)
    # The presence of None values indicates merging is detected
    if total_cells > 0 and total_cells < 204:
        print("  RESULT: PASS - Cell merging detected correctly")
        return True
    elif total_cells == 204:
        print("  RESULT: FAIL - All cells treated as 1x1 (no merging detected)")
        return False
    else:
        print(f"  RESULT: INCONCLUSIVE - {total_cells} cells found")
        return None
 def test_1_3_cell_boxes_validation():
    """Test 1.3: Verify cell_boxes coordinate validation"""
    print("\n" + "="*60)
    print("TEST 1.3: Cell Boxes Coordinate Validation")
    print("="*60)
    # Test case 1: Valid coordinates
    valid_boxes = [
        [10, 10, 100, 50],
        [100, 10, 200, 50],
        [10, 50, 200, 100]
    ]
    result = validate_cell_boxes(valid_boxes, [0, 0, 300, 200], 300, 200)
    print(f"  Valid boxes: valid={result['valid']}, invalid_count={result['invalid_count']}")
    assert result['valid'], "Valid boxes should pass validation"
    # Test case 2: Out of bounds coordinates
    invalid_boxes = [
        [-10, 10, 100, 50],    # x0 < 0
        [10, 10, 400, 50],     # x1 > page_width
        [10, 10, 100, 300]     # y1 > page_height
    ]
    result = validate_cell_boxes(invalid_boxes, [0, 0, 300, 200], 300, 200)
    print(f"  Invalid boxes: valid={result['valid']}, invalid_count={result['invalid_count']}")
    assert not result['valid'], "Invalid boxes should fail validation"
    assert result['invalid_count'] == 3, "Should detect 3 invalid boxes"
    # Test case 3: Clamping
    assert len(result['clamped_boxes']) == 3, "Should return clamped boxes"
    clamped = result['clamped_boxes'][0]
    assert clamped[0] >= 0, "Clamped x0 should be >= 0"
    print("  RESULT: PASS - Coordinate validation works correctly")
    return True
 def test_1_4_tiny_image_filtering():
    """Test 1.4: Verify tiny decoration image filtering"""
    print("\n" + "="*60)
    print("TEST 1.4: Tiny Decoration Image Filtering")
    print("="*60)
    pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
    if not pdf_path.exists():
        print(f"SKIP: {pdf_path} not found")
        return None
    doc = fitz.open(str(pdf_path))
    tiny_count = 0
    normal_count = 0
    min_area = 200  # Same threshold as in DirectExtractionEngine
    for page_num, page in enumerate(doc):
        images = page.get_images()
        for img in images:
            xref = img[0]
            rects = page.get_image_rects(xref)
            if rects:
                rect = rects[0]
                area = (rect.x1 - rect.x0) * (rect.y1 - rect.y0)
                if area < min_area:
                    tiny_count += 1
                    print(f"  Page {page_num}: Tiny image xref={xref}, area={area:.1f} px²")
                else:
                    normal_count += 1
    doc.close()
    print(f"\n  Tiny images (< {min_area} px²): {tiny_count}")
    print(f"  Normal images: {normal_count}")
    if tiny_count > 0:
        print("  RESULT: PASS - Tiny images detected, will be filtered")
        return True
    else:
        print("  RESULT: INFO - No tiny images found in test file")
        return None
 def test_1_5_covering_image_detection():
    """Test 1.5: Verify covering image detection"""
    print("\n" + "="*60)
    print("TEST 1.5: Covering Image Detection")
    print("="*60)
    pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
    if not pdf_path.exists():
        print(f"SKIP: {pdf_path} not found")
        return None
    engine = DirectExtractionEngine(
        enable_whiteout_detection=True,
        whiteout_iou_threshold=0.8
    )
    doc = fitz.open(str(pdf_path))
    total_covering = 0
    for page_num, page in enumerate(doc):
        result = engine._preprocess_page(page, page_num, doc)
        covering_images = result.get('covering_images', [])
        if covering_images:
            print(f"  Page {page_num}: {len(covering_images)} covering images detected")
            for img in covering_images[:3]:  # Show first 3
                print(f"    - xref={img.get('xref')}, type={img.get('color_type')}, "
                      f"bbox={[round(x, 1) for x in img.get('bbox', [])]}")
            total_covering += len(covering_images)
    doc.close()
    print(f"\n  Total covering images detected: {total_covering}")
    if total_covering > 0:
        print("  RESULT: PASS - Covering images detected, will be filtered")
        return True
    else:
        print("  RESULT: INFO - No covering images found in test file")
        return None
 def test_direct_extraction_full():
    """Full integration test for Direct Track extraction"""
    print("\n" + "="*60)
    print("INTEGRATION TEST: Direct Track Full Extraction")
    print("="*60)
    pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
    if not pdf_path.exists():
        print(f"SKIP: {pdf_path} not found")
        return None
    engine = DirectExtractionEngine(
        enable_table_detection=True,
        enable_image_extraction=True,
        min_image_area=200.0,
        enable_whiteout_detection=True
    )
    try:
        result = engine.extract(pdf_path)  # Pass Path object, not string
        # Count elements
        table_count = 0
        image_count = 0
        merged_table_count = 0
        for page in result.pages:
            for elem in page.elements:
                if elem.type.value == 'table':
                    table_count += 1
                    if elem.content and hasattr(elem.content, 'cells'):
                        # Check for merged cells
                        for cell in elem.content.cells:
                            if cell.row_span > 1 or cell.col_span > 1:
                                merged_table_count += 1
                                break
                elif elem.type.value == 'image':
                    image_count += 1
        print(f"  Document ID: {result.document_id}")
        print(f"  Pages: {len(result.pages)}")
        print(f"  Tables: {table_count} (with merging: {merged_table_count})")
        print(f"  Images: {image_count}")
        print("  RESULT: PASS - Extraction completed successfully")
        return True
    except Exception as e:
        print(f"  RESULT: FAIL - {e}")
        import traceback
        traceback.print_exc()
        return False
 if __name__ == "__main__":
    print("="*60)
    print("Phase 1 Bug Fixes Verification Tests")
    print("="*60)
    results = {}
    # Run tests
    results['1.1_table_merging'] = test_1_1_table_cell_merging()
    results['1.3_coord_validation'] = test_1_3_cell_boxes_validation()
    results['1.4_tiny_filtering'] = test_1_4_tiny_image_filtering()
    results['1.5_covering_detection'] = test_1_5_covering_image_detection()
    results['integration'] = test_direct_extraction_full()
    # Summary
    print("\n" + "="*60)
    print("TEST SUMMARY")
    print("="*60)
    for test_name, result in results.items():
        status = "PASS" if result is True else "FAIL" if result is False else "SKIP/INFO"
        print(f"  {test_name}: {status}")
    passed = sum(1 for r in results.values() if r is True)
    failed = sum(1 for r in results.values() if r is False)
    skipped = sum(1 for r in results.values() if r is None)
    print(f"\n  Total: {passed} passed, {failed} failed, {skipped} skipped/info")
--- a/frontend/src/components/ProcessingTrackSelector.tsx
+++ b/frontend/src/components/ProcessingTrackSelector.tsx
@@ -0,0 +1,148 @@
 import { Card, CardContent, CardHeader, CardTitle } from '@/components/ui/card'
 import { Badge } from '@/components/ui/badge'
 import { Cpu, FileText, Sparkles, Info } from 'lucide-react'
 import type { ProcessingTrack, DocumentAnalysisResponse } from '@/types/apiV2'
 interface ProcessingTrackSelectorProps {
  value: ProcessingTrack | null  // null means "use system recommendation"
  onChange: (track: ProcessingTrack | null) => void
  documentAnalysis?: DocumentAnalysisResponse | null
  disabled?: boolean
 }
 export default function ProcessingTrackSelector({
  value,
  onChange,
  documentAnalysis,
  disabled = false,
 }: ProcessingTrackSelectorProps) {
  const recommendedTrack = documentAnalysis?.recommended_track
  const tracks = [
    {
      id: null as ProcessingTrack | null,
      name: '自動選擇',
      description: '根據文件類型自動選擇最佳處理方式',
      icon: Sparkles,
      color: 'text-purple-600',
      bgColor: 'bg-purple-50',
      borderColor: 'border-purple-200',
      recommended: false,
    },
    {
      id: 'direct' as ProcessingTrack,
      name: '直接提取 (DIRECT)',
      description: '從 PDF 中直接提取文字圖層，適用於可編輯 PDF',
      icon: FileText,
      color: 'text-blue-600',
      bgColor: 'bg-blue-50',
      borderColor: 'border-blue-200',
      recommended: recommendedTrack === 'direct',
    },
    {
      id: 'ocr' as ProcessingTrack,
      name: 'OCR 識別',
      description: '使用光學字元識別處理圖片或掃描文件',
      icon: Cpu,
      color: 'text-green-600',
      bgColor: 'bg-green-50',
      borderColor: 'border-green-200',
      recommended: recommendedTrack === 'ocr',
    },
  ]
  return (
    <Card>
      <CardHeader>
        <div className="flex items-center gap-3">
          <div className="p-2 bg-primary/10 rounded-lg">
            <Sparkles className="w-5 h-5 text-primary" />
          </div>
          <div>
            <CardTitle>處理方式選擇</CardTitle>
            <p className="text-sm text-muted-foreground mt-1">
              選擇文件的處理方式，或讓系統自動判斷
            </p>
          </div>
        </div>
      </CardHeader>
      <CardContent className="space-y-3">
        {/* Info about override */}
        {value !== null && recommendedTrack && value !== recommendedTrack && (
          <div className="flex items-start gap-2 p-3 bg-amber-50 border border-amber-200 rounded-lg">
            <Info className="w-4 h-4 text-amber-600 flex-shrink-0 mt-0.5" />
            <p className="text-sm text-amber-800">
              您已覆蓋系統建議。系統原本建議使用「{recommendedTrack === 'direct' ? '直接提取' : 'OCR 識別'}」方式處理此文件。
            </p>
          </div>
        )}
        {/* Track options */}
        <div className="grid gap-3">
          {tracks.map((track) => {
            const isSelected = value === track.id
            const Icon = track.icon
            return (
              <button
                key={track.id ?? 'auto'}
                type="button"
                disabled={disabled}
                onClick={() => onChange(track.id)}
                className={`
                  w-full p-4 rounded-lg border-2 text-left transition-all
                  ${isSelected
                    ? `${track.borderColor} ${track.bgColor}`
                    : 'border-border hover:border-primary/30 hover:bg-muted/30'
                  }
                  ${disabled ? 'opacity-50 cursor-not-allowed' : 'cursor-pointer'}
                `}
              >
                <div className="flex items-start gap-3">
                  <div className={`p-2 rounded-lg ${isSelected ? track.bgColor : 'bg-muted'}`}>
                    <Icon className={`w-5 h-5 ${isSelected ? track.color : 'text-muted-foreground'}`} />
                  </div>
                  <div className="flex-1 min-w-0">
                    <div className="flex items-center gap-2">
                      <span className={`font-medium ${isSelected ? track.color : ''}`}>
                        {track.name}
                      </span>
                      {track.recommended && (
                        <Badge variant="outline" className="text-xs bg-white">
                          系統建議
                        </Badge>
                      )}
                      {isSelected && (
                        <Badge variant="default" className="text-xs">
                          已選擇
                        </Badge>
                      )}
                    </div>
                    <p className="text-sm text-muted-foreground mt-1">
                      {track.description}
                    </p>
                  </div>
                </div>
              </button>
            )
          })}
        </div>
        {/* Current analysis info */}
        {documentAnalysis && (
          <div className="pt-3 border-t border-border">
            <div className="flex flex-wrap gap-x-4 gap-y-1 text-xs text-muted-foreground">
              <span>文件分析信心度: {(documentAnalysis.confidence * 100).toFixed(0)}%</span>
              {documentAnalysis.page_count && (
                <span>頁數: {documentAnalysis.page_count}</span>
              )}
              {documentAnalysis.text_coverage !== null && (
                <span>文字覆蓋率: {(documentAnalysis.text_coverage * 100).toFixed(1)}%</span>
              )}
            </div>
          </div>
        )}
      </CardContent>
    </Card>
  )
 }
--- a/frontend/src/pages/ProcessingPage.tsx
+++ b/frontend/src/pages/ProcessingPage.tsx
@@ -8,14 +8,15 @@ import { Button } from '@/components/ui/button'
 import { Badge } from '@/components/ui/badge'
 import { useToast } from '@/components/ui/toast'
 import { apiClientV2 } from '@/services/apiV2'
-import { Play, CheckCircle, FileText, AlertCircle, Clock, Activity, Loader2, Info } from 'lucide-react'
+import { Play, CheckCircle, FileText, AlertCircle, Clock, Activity, Loader2 } from 'lucide-react'
 import LayoutModelSelector from '@/components/LayoutModelSelector'
 import PreprocessingSettings from '@/components/PreprocessingSettings'
 import PreprocessingPreview from '@/components/PreprocessingPreview'
 import TableDetectionSelector from '@/components/TableDetectionSelector'
 import ProcessingTrackSelector from '@/components/ProcessingTrackSelector'
 import TaskNotFound from '@/components/TaskNotFound'
 import { useTaskValidation } from '@/hooks/useTaskValidation'
-import type { LayoutModel, ProcessingOptions, PreprocessingMode, PreprocessingConfig, TableDetectionConfig, DocumentAnalysisResponse } from '@/types/apiV2'
+import type { LayoutModel, ProcessingOptions, PreprocessingMode, PreprocessingConfig, TableDetectionConfig, ProcessingTrack } from '@/types/apiV2'
 export default function ProcessingPage() {
  const { t } = useTranslation()
@@ -56,6 +57,9 @@ export default function ProcessingPage() {
    enable_region_detection: true,
  })
  // Processing track override state (null = use system recommendation)
  const [forceTrack, setForceTrack] = useState<ProcessingTrack | null>(null)
  // Analyze document to determine if OCR is needed (only for pending tasks)
  const { data: documentAnalysis, isLoading: isAnalyzing } = useQuery({
    queryKey: ['documentAnalysis', taskId],
@@ -65,16 +69,23 @@ export default function ProcessingPage() {
  })
  // Determine if preprocessing options should be shown
-  // Only show for OCR track files (images and non-editable PDFs)
+  // Show OCR options when:
-  const needsOcrTrack = documentAnalysis?.recommended_track === 'ocr' ||
+  // 1. User explicitly selected OCR track
  // 2. OR system recommends OCR/hybrid track (and user hasn't overridden to direct)
  // 3. OR still analyzing (show by default)
  const needsOcrTrack = forceTrack === 'ocr' ||
    (forceTrack === null && (
      documentAnalysis?.recommended_track === 'ocr' ||
      documentAnalysis?.recommended_track === 'hybrid' ||
-    !documentAnalysis // Show by default while analyzing
+      !documentAnalysis
    ))
  // Start OCR processing
  const processOCRMutation = useMutation({
    mutationFn: () => {
      const options: ProcessingOptions = {
-        use_dual_track: true,
+        use_dual_track: forceTrack === null,  // Only use dual-track auto-detection if not forcing
        force_track: forceTrack || undefined,  // Pass force_track if user selected one
        language: 'ch',
        layout_model: layoutModel,
        preprocessing_mode: preprocessingMode,
@@ -392,53 +403,14 @@ export default function ProcessingPage() {
            </div>
          )}
-          {/* Document Analysis Info */}
+          {/* Processing Track Selector - Always show after analysis */}
-          {documentAnalysis && (
+          {!isAnalyzing && (
-            <Card className={documentAnalysis.recommended_track === 'direct' ? 'border-blue-200 bg-blue-50' : 'border-green-200 bg-green-50'}>
+            <ProcessingTrackSelector
-              <CardContent className="pt-4">
+              value={forceTrack}
-                <div className="flex items-start gap-3">
+              onChange={setForceTrack}
-                  <Info className={`w-5 h-5 flex-shrink-0 mt-0.5 ${documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}`} />
+              documentAnalysis={documentAnalysis}
-                  <div className="flex-1">
+              disabled={processOCRMutation.isPending}
-                    {documentAnalysis.recommended_track === 'direct' ? (
+            />
                      <>
                        <p className="text-sm font-medium text-blue-800">此文件為可編輯 PDF</p>
                        <p className="text-sm text-blue-700 mt-1">
                          系統偵測到此 PDF 包含文字圖層，將使用直接文字提取方式處理。
                          版面偵測和影像前處理設定不適用於此類文件。
                        </p>
                      </>
                    ) : (
                      <>
                        <p className="text-sm font-medium text-green-800">
                          {documentAnalysis.is_editable ? '混合文件' : '掃描文件 / 影像'}
                        </p>
                        <p className="text-sm text-green-700 mt-1">
                          {documentAnalysis.reason}
                        </p>
                      </>
                    )}
                    <div className="flex flex-wrap gap-4 mt-2 text-xs">
                      <span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
                        處理方式: {documentAnalysis.recommended_track === 'direct' ? '直接提取' : documentAnalysis.recommended_track === 'ocr' ? 'OCR 識別' : '混合處理'}
                      </span>
                      {documentAnalysis.page_count && (
                        <span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
                          頁數: {documentAnalysis.page_count}
                        </span>
                      )}
                      {documentAnalysis.text_coverage !== null && (
                        <span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
                          文字覆蓋率: {(documentAnalysis.text_coverage * 100).toFixed(1)}%
                        </span>
                      )}
                      <span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
                        信心度: {(documentAnalysis.confidence * 100).toFixed(0)}%
                      </span>
                    </div>
                  </div>
                </div>
              </CardContent>
            </Card>
          )}
          {/* OCR Track Options - Only show when document needs OCR */}
--- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md
+++ b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/design.md
--- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/proposal.md
+++ b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/proposal.md
--- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/specs/result-export/spec.md
+++ b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/specs/result-export/spec.md
--- a/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/tasks.md
+++ b/openspec/changes/archive/2025-12-04-improve-translated-text-fitting/tasks.md
--- a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/design.md
+++ b/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/design.md
--- a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/proposal.md
+++ b/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/proposal.md
--- a/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md
+++ b/openspec/changes/archive/2025-12-04-pdf-preprocessing-pipeline/tasks.md
--- a/openspec/changes/refactor-dual-track-architecture/design.md
+++ b/openspec/changes/refactor-dual-track-architecture/design.md
@@ -0,0 +1,240 @@
 # Design: Refactor Dual-Track Architecture
 ## Context
 Tool_OCR 是一個雙軌制文件處理系統，支援：
 - **Direct Track**: 從可編輯 PDF 直接提取結構化內容
 - **OCR Track**: 使用 PaddleOCR + PP-StructureV3 進行光學字符識別
 目前系統存在以下技術債務：
 - OCRService (2,326 行) 承擔過多職責
 - PDFGeneratorService (4,644 行) 是單體服務
 - 記憶體管理分散在多個組件中
 - 已知 bug 影響輸出品質
 ## Goals / Non-Goals
 ### Goals
 - 修復 PLAN.md 中列出的所有已知 bug
 - 將 OCRService 拆分為 < 800 行的可維護單元
 - 將 PDFGeneratorService 拆分為 < 2,000 行
 - 簡化記憶體管理配置
 - 提升前端狀態管理一致性
 ### Non-Goals
 - 不改變現有 API 契約
 - 不引入新的外部依賴
 - 不改變資料庫 schema
 - 不改變使用者介面
 ## Decisions
 ### Decision 1: 使用 PyMuPDF find_tables() 取代自定義表格檢測
 **選擇**: 使用 PyMuPDF 內建的 `page.find_tables()` API
 **理由**:
 - PyMuPDF 的表格檢測能正確識別合併單元格
 - 返回的 `table.cells` 結構包含 span 資訊
 - 減少自定義代碼維護負擔
 **替代方案**:
 - 改進 `_detect_tables_by_position()` 算法
  - 優點：不依賴外部 API 變更
  - 缺點：複雜度高，難以處理所有邊界情況
 - 使用 Camelot 或 Tabula
  - 優點：成熟的表格提取庫
  - 缺點：引入新依賴，增加系統複雜度
 ### Decision 2: 使用 Strategy Pattern 重構服務層
 **選擇**: 引入 ProcessingOrchestrator 使用策略模式
 ```python
 class ProcessingPipeline(Protocol):
    def process(self, file_path: str, options: ProcessingOptions) -> UnifiedDocument:
        ...
 class DirectPipeline(ProcessingPipeline):
    def __init__(self, extraction_engine: DirectExtractionEngine):
        self.engine = extraction_engine
    def process(self, file_path, options):
        return self.engine.extract(file_path)
 class OCRPipeline(ProcessingPipeline):
    def __init__(self, ocr_service: OCRService, preprocessor: LayoutPreprocessingService):
        self.ocr = ocr_service
        self.preprocessor = preprocessor
    def process(self, file_path, options):
        # Preprocessing + OCR + Conversion
        ...
 class ProcessingOrchestrator:
    def __init__(self, detector: DocumentTypeDetector, pipelines: dict[str, ProcessingPipeline]):
        self.detector = detector
        self.pipelines = pipelines
    def process(self, file_path, options):
        track = options.force_track or self.detector.detect(file_path).track
        return self.pipelines[track].process(file_path, options)
 ```
 **理由**:
 - 職責分離：檢測、處理、轉換各自獨立
 - 易於測試：可以單獨測試每個 Pipeline
 - 易於擴展：新增處理方式只需添加新 Pipeline
 **替代方案**:
 - 使用 Chain of Responsibility
  - 優點：更靈活的處理鏈
  - 缺點：對於二選一的場景過於複雜
 - 保持現狀，只做代碼整理
  - 優點：風險最低
  - 缺點：無法解決根本問題
 ### Decision 3: 分層提取 PDF 生成邏輯
 **選擇**: 將 PDFGeneratorService 拆分為三個模組
 ```
 PDFGeneratorService (主要編排)
 ├── PDFTableRenderer (表格渲染)
 │   ├── HTMLTableParser (HTML 表格解析)
 │   └── CellRenderer (單元格渲染)
 ├── PDFFontManager (字體管理)
 │   ├── FontLoader (字體載入)
 │   └── FontFallback (字體 fallback)
 └── PDFLayoutEngine (版面配置)
 ```
 **理由**:
 - 單一職責：每個模組專注一件事
 - 可重用：FontManager 可被其他服務使用
 - 易於測試：表格渲染可獨立測試
 ### Decision 4: 統一記憶體策略引擎
 **選擇**: 合併記憶體管理組件為單一 MemoryPolicyEngine
 ```python
 class MemoryPolicyEngine:
    """統一的記憶體策略引擎"""
    def __init__(self, config: MemoryConfig):
        self.config = config
        self._semaphore = asyncio.Semaphore(config.max_concurrent_predictions)
    @property
    def gpu_usage_percent(self) -> float:
        # 統一的 GPU 使用率查詢
        ...
    def check_availability(self) -> MemoryStatus:
        # 返回 AVAILABLE, WARNING, CRITICAL, EMERGENCY
        ...
    async def acquire_prediction_slot(self):
        # 統一的並發控制
        ...
    def cleanup_if_needed(self):
        # 根據狀態自動清理
        ...
@dataclass
 class MemoryConfig:
    warning_threshold: float = 0.80      # 80%
    critical_threshold: float = 0.95     # 95%
    max_concurrent_predictions: int = 2
    model_idle_timeout: int = 300        # 5 minutes
 ```
 **理由**:
 - 減少配置項：從 8+ 降到 4 個核心配置
 - 簡化依賴：服務只需依賴一個記憶體引擎
 - 統一行為：所有記憶體決策在同一處做出
 ### Decision 5: 使用 Zustand 管理任務狀態
 **選擇**: 新增 TaskStore 統一管理任務狀態
 ```typescript
 interface TaskState {
  currentTaskId: string | null;
  tasks: Record<string, TaskDetail>;
  processingStatus: Record<string, ProcessingStatus>;
 }
 interface TaskActions {
  setCurrentTask: (taskId: string) => void;
  updateTask: (taskId: string, updates: Partial<TaskDetail>) => void;
  updateProcessingStatus: (taskId: string, status: ProcessingStatus) => void;
  clearTasks: () => void;
 }
 const useTaskStore = create<TaskState & TaskActions>()(
  persist(
    (set) => ({
      currentTaskId: null,
      tasks: {},
      processingStatus: {},
      // ... actions
    }),
    { name: 'task-storage' }
  )
 );
 ```
 **理由**:
 - 一致性：與現有 uploadStore、authStore 模式一致
 - 可追蹤：任務狀態變更集中管理
 - 持久化：刷新頁面後狀態保留
 ## Risks / Trade-offs
 | 風險 | 影響 | 緩解措施 |
 |------|------|----------|
 | PyMuPDF find_tables() API 變更 | 中 | 封裝為獨立函數，易於替換 |
 | 服務重構導致處理邏輯錯誤 | 高 | 保留原有測試，逐步重構 |
 | 記憶體引擎改變導致 OOM | 高 | 使用相同閾值，僅改變代碼結構 |
 | 前端狀態遷移導致 bug | 中 | 逐頁遷移，完整測試每個頁面 |
 ## Migration Plan
 ### Step 1: Bug Fixes (可獨立部署)
 1. 實現 PyMuPDF find_tables() 整合
 2. 修復 OCR Track 圖片路徑
 3. 添加 cell_boxes 座標驗證
 4. 測試並部署
 ### Step 2: Service Refactoring (可獨立部署)
 1. 提取 ProcessingOrchestrator
 2. 提取 TableRenderer 和 FontManager
 3. 更新 OCRService 使用新組件
 4. 測試並部署
 ### Step 3: Memory Management (可獨立部署)
 1. 實現 MemoryPolicyEngine
 2. 逐步遷移服務使用新引擎
 3. 移除舊組件
 4. 測試並部署
 ### Step 4: Frontend Improvements (可獨立部署)
 1. 新增 TaskStore
 2. 遷移 ProcessingPage
 3. 遷移 TaskDetailPage
 4. 合併類型定義
 5. 測試並部署
 ### Rollback Plan
 - 每個 Step 獨立部署，問題時可回滾到上一個穩定版本
 - Bug fixes 優先，確保基本功能正確
 - 重構不改變外部行為，回滾影響最小
 ## Open Questions
 1. **PyMuPDF find_tables() 的版本相容性**: 需確認目前使用的 PyMuPDF 版本是否支援此 API
 2. **前端狀態持久化範圍**: 是否所有任務都需要持久化，還是只保留當前會話？
 3. **記憶體閾值調整**: 現有閾值是否經過生產驗證，可以直接沿用？
--- a/openspec/changes/refactor-dual-track-architecture/proposal.md
+++ b/openspec/changes/refactor-dual-track-architecture/proposal.md
@@ -0,0 +1,68 @@
 # Change: Refactor Dual-Track Architecture
 ## Why
 目前雙軌制 OCR 系統存在多個已知問題和架構債務：
 1. **Direct Track 表格問題**: `_detect_tables_by_position()` 無法識別合併單元格，導致 edit3.pdf 產生 204 個錯誤拆分的 cells（應為 83 個）
 2. **OCR Track 圖片路徑丟失**: CHART/DIAGRAM 等視覺元素的 `saved_path` 在轉換時丟失，導致圖片未放回 PDF
 3. **OCR Track cell_boxes 座標錯亂**: PP-StructureV3 返回的 cell_boxes 超出頁面邊界
 4. **服務層過度複雜**: OCRService (2,326 行) 承擔過多職責，難以維護和測試
 5. **PDF 生成器過於龐大**: PDFGeneratorService (4,644 行) 是單體服務，難以擴展
 ## What Changes
 ### Phase 1: 修復已知 Bug（優先級：最高）
 - **Direct Track 表格修復**: 改用 PyMuPDF `find_tables()` API 取代 `_detect_tables_by_position()`
 - **OCR Track 圖片路徑修復**: 擴展 `_convert_pp3_element` 處理所有視覺元素類型 (IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP)
 - **Cell boxes 座標驗證**: 添加邊界檢查，超出範圍時使用 CV 線檢測 fallback
 - **過濾極小裝飾圖片**: 過濾 < 200 px² 的圖片
 - **移除覆蓋圖像**: 在渲染階段過濾與 covering_images 重疊的圖片
 ### Phase 2: 服務層重構（優先級：高）
 - **拆分 OCRService**: 提取獨立的 `ProcessingOrchestrator` 負責流程編排
 - **建立 Pipeline 模式**: 使用組合模式取代目前的聚合模式
 - **提取 TableRenderer**: 從 PDFGeneratorService 提取表格渲染邏輯
 - **提取 FontManager**: 從 PDFGeneratorService 提取字體管理邏輯
 ### Phase 3: 記憶體管理簡化（優先級：中）
 - **統一記憶體策略**: 合併 MemoryManager、MemoryGuard、各類 Semaphore 為單一策略引擎
 - **簡化配置**: 減少 8+ 個記憶體相關配置項到核心 3-4 項
 ### Phase 4: 前端狀態管理改進（優先級：中）
 - **新增 TaskStore**: 使用 Zustand 管理任務狀態，取代分散的 useState
 - **合併類型定義**: 統一 api.ts 和 apiV2.ts 為單一類型定義檔案
 ## Impact
 - Affected specs: `document-processing`
 - Affected code:
  - `backend/app/services/direct_extraction_engine.py` (表格檢測)
  - `backend/app/services/ocr_to_unified_converter.py` (元素轉換)
  - `backend/app/services/ocr_service.py` (服務編排)
  - `backend/app/services/pdf_generator_service.py` (PDF 生成)
  - `backend/app/services/memory_manager.py` (記憶體管理)
  - `frontend/src/store/` (狀態管理)
  - `frontend/src/types/` (類型定義)
 ## Risk Assessment
 | 風險 | 嚴重性 | 緩解措施 |
 |------|--------|----------|
 | 表格渲染回歸 | 高 | 使用 edit.pdf 和 edit3.pdf 作為回歸測試 |
 | 記憶體管理變更導致 OOM | 高 | 保留現有閾值，僅重構代碼結構 |
 | 服務重構導致處理失敗 | 中 | 逐步重構，每階段完整測試 |
 ## Success Metrics
 | 指標 | 目前 | 目標 |
 |------|------|------|
 | edit3.pdf Direct Track cells | 204 (錯誤) | 83 (正確) |
 | OCR Track 圖片放回率 | 0% | 100% |
 | cell_boxes 座標正確率 | ~40% | 100% |
 | OCRService 行數 | 2,326 | < 800 |
 | PDFGeneratorService 行數 | 4,644 | < 2,000 |
--- a/openspec/changes/refactor-dual-track-architecture/specs/document-processing/spec.md
+++ b/openspec/changes/refactor-dual-track-architecture/specs/document-processing/spec.md
@@ -0,0 +1,151 @@
 # document-processing Specification Delta
 ## ADDED Requirements
 ### Requirement: Table Cell Merging Detection
 The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents.
 #### Scenario: Detect merged cells in Direct Track
 - **WHEN** extracting tables from an editable PDF using Direct Track
 - **THEN** the system SHALL use PyMuPDF find_tables() API
 - **AND** correctly identify cells with rowspan > 1 or colspan > 1
 - **AND** preserve merge information in UnifiedDocument table structure
 - **AND** skip placeholder cells that are covered by merged cells
 #### Scenario: Handle complex table structures
 - **WHEN** processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges)
 - **THEN** the system SHALL NOT split merged cells into individual cells
 - **AND** the output cell count SHALL match the actual visual cell count
 - **AND** the rendered PDF SHALL display correct merged cell boundaries
 ### Requirement: Visual Element Path Preservation
 The system SHALL preserve image paths for all visual element types during OCR conversion.
 #### Scenario: Preserve CHART element paths
 - **WHEN** converting PP-StructureV3 output containing CHART elements
 - **THEN** the system SHALL treat CHART as a visual element type
 - **AND** extract saved_path from the element data
 - **AND** include saved_path in the UnifiedDocument content field
 #### Scenario: Support all visual element types
 - **WHEN** processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP
 - **THEN** the system SHALL extract saved_path or img_path for each element
 - **AND** preserve path, width, height, and format in content dictionary
 - **AND** enable downstream PDF generation to embed these images
 #### Scenario: Fallback path resolution
 - **WHEN** a visual element has multiple path fields (saved_path, img_path)
 - **THEN** the system SHALL prefer saved_path over img_path
 - **AND** fallback to img_path if saved_path is missing
 - **AND** log warning if both paths are missing
 ### Requirement: Cell Box Coordinate Validation
 The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases.
 #### Scenario: Detect out-of-bounds coordinates
 - **WHEN** processing cell_boxes from PP-StructureV3
 - **THEN** the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height)
 - **AND** log tables with coordinates exceeding page bounds
 - **AND** mark affected cells for fallback processing
 #### Scenario: Apply CV line detection fallback
 - **WHEN** cell_boxes coordinates are invalid (out of bounds)
 - **THEN** the system SHALL apply OpenCV line detection as fallback
 - **AND** reconstruct table structure from detected lines
 - **AND** include fallback_used flag in table metadata
 #### Scenario: Coordinate normalization
 - **WHEN** coordinates are within page bounds but slightly outside table bbox
 - **THEN** the system SHALL clamp coordinates to table boundaries
 - **AND** preserve relative cell positions
 - **AND** ensure no cells overlap after normalization
 ### Requirement: Decoration Image Filtering
 The system SHALL filter out minimal decoration images that do not contribute meaningful content.
 #### Scenario: Filter tiny images by area
 - **WHEN** extracting images from a document
 - **THEN** the system SHALL calculate image area (width x height)
 - **AND** filter out images with area < 200 square pixels
 - **AND** log filtered image count for debugging
 #### Scenario: Configurable filtering threshold
 - **WHEN** processing documents with intentionally small images
 - **THEN** the system SHALL support configuration of minimum image area threshold
 - **AND** default to 200 square pixels if not specified
 - **AND** allow threshold = 0 to disable filtering
 ### Requirement: Covering Image Removal
 The system SHALL remove covering/redaction images from the final output.
 #### Scenario: Detect covering rectangles
 - **WHEN** preprocessing a PDF page
 - **THEN** the system SHALL detect black/white rectangles covering text regions
 - **AND** identify covering images by high IoU (> 0.8) with underlying content
 - **AND** mark covering images for exclusion
 #### Scenario: Exclude covering images from rendering
 - **WHEN** generating output PDF
 - **THEN** the system SHALL exclude images marked as covering
 - **AND** preserve the text content that was covered
 - **AND** include covering_images_removed count in metadata
 #### Scenario: Handle both black and white covering
 - **WHEN** detecting covering rectangles
 - **THEN** the system SHALL detect both black fill (redaction style)
 - **AND** white fill (whiteout style)
 - **AND** low-contrast rectangles intended to hide content
 ## MODIFIED Requirements
 ### Requirement: Enhanced OCR with Full PP-StructureV3
 The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
 #### Scenario: Extract comprehensive document structure
 - **WHEN** processing through OCR track
 - **THEN** the system SHALL use page_result.json['parsing_res_list']
 - **AND** extract all element types including headers, lists, tables, figures
 - **AND** preserve layout_bbox coordinates for each element
 #### Scenario: Maintain reading order
 - **WHEN** extracting elements from PP-StructureV3
 - **THEN** the system SHALL preserve the reading order from parsing_res_list
 - **AND** assign sequential indices to elements
 - **AND** support reordering for complex layouts
 #### Scenario: Extract table structure
 - **WHEN** PP-StructureV3 identifies a table
 - **THEN** the system SHALL extract cell content and boundaries
 - **AND** validate cell_boxes coordinates against page boundaries
 - **AND** apply fallback detection for invalid coordinates
 - **AND** preserve table HTML for structure
 - **AND** extract plain text for translation
 #### Scenario: Extract visual elements with paths
 - **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
 - **THEN** the system SHALL preserve saved_path for each element
 - **AND** include image dimensions and format
 - **AND** enable image embedding in output PDF
 ### Requirement: Generate UnifiedDocument from direct extraction
 The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
 #### Scenario: Extract tables with cell merging
 - **WHEN** direct extraction encounters a table
 - **THEN** the system SHALL use PyMuPDF find_tables() API
 - **AND** extract cell content with correct rowspan/colspan
 - **AND** preserve merged cell boundaries
 - **AND** skip placeholder cells covered by merges
 #### Scenario: Filter decoration images
 - **WHEN** extracting images from PDF
 - **THEN** the system SHALL filter images smaller than minimum area threshold
 - **AND** exclude covering/redaction images
 - **AND** preserve meaningful content images
 #### Scenario: Preserve text styling with image handling
 - **WHEN** direct extraction completes
 - **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
 - **AND** preserve text styling, fonts, and exact positioning
 - **AND** extract tables with cell boundaries, content, and merge info
 - **AND** include only meaningful images in output
--- a/openspec/changes/refactor-dual-track-architecture/tasks.md
+++ b/openspec/changes/refactor-dual-track-architecture/tasks.md
@@ -0,0 +1,108 @@
 # Tasks: Refactor Dual-Track Architecture
 ## Phase 1: 修復已知 Bug (已完成)
 ### 1.1 Direct Track 表格修復 (已完成 ✓)
 - [x] 1.1.1 修改 `_process_native_table()` 方法使用 `table.cells` 處理合併單元格
 - [x] 1.1.2 使用 PyMuPDF `page.find_tables()` API (已在使用中)
 - [x] 1.1.3 解析 `table.cells` 並正確計算 `row_span`/`col_span`
 - [x] 1.1.4 處理被合併的單元格（跳過 `None` 值，建立 covered grid）
 - [x] 1.1.5 驗證 edit3.pdf 返回 83 個正確的 cells ✓
 ### 1.2 OCR Track 圖片路徑修復 (已完成 ✓)
 - [x] 1.2.1 修改 `ocr_to_unified_converter.py` 第 604-613 行
 - [x] 1.2.2 擴展視覺元素類型判斷：`IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP`
 - [x] 1.2.3 優先使用 `saved_path`，fallback 到 `img_path`
 - [x] 1.2.4 確保 content dict 包含 `saved_path`, `path`, `width`, `height`, `format`
 - [x] 1.2.5 程式碼已修正 (需 OCR Track 完整測試驗證)
 - [x] 1.2.6 程式碼已修正 (需 OCR Track 完整測試驗證)
 ### 1.3 Cell boxes 座標驗證 (已完成 ✓)
 - [x] 1.3.1 在 `ocr_to_unified_converter.py` 添加 `validate_cell_boxes()` 函數
 - [x] 1.3.2 檢查 cell_boxes 是否超出頁面邊界 (0, 0, page_width, page_height)
 - [x] 1.3.3 超出範圍時使用 clamped coordinates，標記 needs_fallback
 - [x] 1.3.4 添加日誌記錄異常座標
 - [x] 1.3.5 單元測試驗證座標驗證邏輯正確 ✓
 ### 1.4 過濾極小裝飾圖片 (已完成 ✓)
 - [x] 1.4.1 在 `direct_extraction_engine.py` 圖片提取邏輯添加面積檢查
 - [x] 1.4.2 過濾 `image_area < min_image_area` (默認 200 px²) 的圖片
 - [x] 1.4.3 添加 `min_image_area` 配置項允許調整閾值
 - [x] 1.4.4 驗證 edit3.pdf 偵測到 3 個極小裝飾圖片 ✓
 ### 1.5 移除覆蓋圖像 (已完成 ✓)
 - [x] 1.5.1 傳遞 `covering_images` 到 `_extract_images()` 方法
 - [x] 1.5.2 使用 IoU 閾值 (0.8) 和 xref 比對判斷覆蓋圖像
 - [x] 1.5.3 從最終輸出中排除覆蓋圖像
 - [x] 1.5.4 添加 `_calculate_iou()` 輔助方法
 - [x] 1.5.5 驗證 edit3.pdf 偵測到 6 個黑框覆蓋圖像 ✓
 ## Phase 2: 服務層重構
 ### 2.1 提取 ProcessingOrchestrator
 - [ ] 2.1.1 建立 `backend/app/services/processing_orchestrator.py`
 - [ ] 2.1.2 從 OCRService 提取流程編排邏輯
 - [ ] 2.1.3 定義 `ProcessingPipeline` 介面
 - [ ] 2.1.4 實現 DirectPipeline 和 OCRPipeline
 - [ ] 2.1.5 更新 OCRService 使用 ProcessingOrchestrator
 - [ ] 2.1.6 確保現有功能不受影響
 ### 2.2 提取 TableRenderer
 - [ ] 2.2.1 建立 `backend/app/services/pdf_table_renderer.py`
 - [ ] 2.2.2 從 PDFGeneratorService 提取 HTMLTableParser
 - [ ] 2.2.3 提取表格渲染邏輯到獨立類
 - [ ] 2.2.4 支援合併單元格渲染
 - [ ] 2.2.5 更新 PDFGeneratorService 使用 TableRenderer
 ### 2.3 提取 FontManager
 - [ ] 2.3.1 建立 `backend/app/services/pdf_font_manager.py`
 - [ ] 2.3.2 提取字體載入和快取邏輯
 - [ ] 2.3.3 提取 CJK 字體支援邏輯
 - [ ] 2.3.4 實現字體 fallback 機制
 - [ ] 2.3.5 更新 PDFGeneratorService 使用 FontManager
 ## Phase 3: 記憶體管理簡化
 ### 3.1 統一記憶體策略引擎
 - [ ] 3.1.1 建立 `backend/app/services/memory_policy_engine.py`
 - [ ] 3.1.2 定義統一的記憶體策略介面
 - [ ] 3.1.3 合併 MemoryManager 和 MemoryGuard 邏輯
 - [ ] 3.1.4 整合 Semaphore 管理
 - [ ] 3.1.5 簡化配置到 3-4 個核心項目
 ### 3.2 更新服務使用新記憶體引擎
 - [ ] 3.2.1 更新 OCRService 使用 MemoryPolicyEngine
 - [ ] 3.2.2 更新 ServicePool 使用 MemoryPolicyEngine
 - [ ] 3.2.3 移除舊的 MemoryGuard 引用
 - [ ] 3.2.4 驗證 GPU 記憶體監控正常運作
 ## Phase 4: 前端狀態管理改進
 ### 4.1 新增 TaskStore
 - [ ] 4.1.1 建立 `frontend/src/store/taskStore.ts`
 - [ ] 4.1.2 定義任務狀態結構（currentTask, tasks, processingStatus）
 - [ ] 4.1.3 實現 CRUD 操作和狀態轉換
 - [ ] 4.1.4 添加 localStorage 持久化
 - [ ] 4.1.5 更新 ProcessingPage 使用 TaskStore
 - [ ] 4.1.6 更新 TaskDetailPage 使用 TaskStore
 ### 4.2 合併類型定義
 - [ ] 4.2.1 審查 `api.ts` 和 `apiV2.ts` 的差異
 - [ ] 4.2.2 合併類型定義到 `apiV2.ts`
 - [ ] 4.2.3 移除 `api.ts` 中的重複定義
 - [ ] 4.2.4 更新所有 import 路徑
 - [ ] 4.2.5 驗證 TypeScript 編譯無錯誤
 ## Phase 5: 測試與驗證
 ### 5.1 回歸測試
 - [ ] 5.1.1 使用 edit.pdf 測試 Direct Track（確保無回歸）
 - [ ] 5.1.2 使用 edit3.pdf 測試 Direct Track 表格合併
 - [ ] 5.1.3 使用 edit.pdf 測試 OCR Track 圖片放回
 - [ ] 5.1.4 使用 edit3.pdf 測試 OCR Track 圖片放回
 - [ ] 5.1.5 驗證所有 cell_boxes 座標正確
 ### 5.2 效能測試
 - [ ] 5.2.1 測量重構後的處理時間
 - [ ] 5.2.2 驗證記憶體使用無明顯增加
 - [ ] 5.2.3 驗證 GPU 使用率正常