This commit is contained in:
egg
2025-12-04 18:00:37 +08:00
parent 9437387ef1
commit 8265be1741
22 changed files with 2672 additions and 196 deletions

186
PLAN.md Normal file
View File

@@ -0,0 +1,186 @@
# PDF 處理雙軌制改善計劃 (修訂版 v5)
## 問題分析
### 一、Direct Track 表格問題
| 指標 | edit.pdf | edit3.pdf |
|------|----------|-----------|
| 原始表格結構 | 6 rows x 2 cols | 12 rows x 17 cols |
| PyMuPDF 識別的 cells | 12 (無合併) | **83** (有121個合併) |
| Direct Track 提取的 cells | 12 | **204** (全部視為1x1) |
| 跨欄/跨行識別 | 不需要 | **❌ 完全未識別** |
| 渲染結果 | ✓ 完美 | ❌ 欄位切分錯誤、文字超出 |
**根因**: `_detect_tables_by_position()` 無法識別合併單元格
### 二、Direct Track 圖片問題 (edit3.pdf)
| 問題 | 數量 | 說明 |
|------|------|------|
| 極小裝飾圖片 | 3 | < 200 px²,應過濾 |
| 覆蓋圖像 (黑框) | 6 | 已檢測但未從渲染中移除 |
| 大型 vector_graphics | 3 | 已正確過濾 |
### 三、OCR Track 表格問題
| 表格 | cells | cell_boxes | cell_boxes 坐標檢查 |
|------|-------|------------|-------------------|
| pp3_0_3 | 13 | 13 | 1/5 超出範圍 |
| pp3_0_6 | 29 | 12 | 全部超出範圍 |
| pp3_0_7 | 12 | 51 | 全部超出範圍 |
| pp3_0_16 | 51 | 29 | 全部超出範圍 |
**根因**: PP-StructureV3 cell_boxes 座標系統錯亂
### 四、OCR Track 圖片問題 ❌ 嚴重
| 文件 | 圖片元素 | PP-Structure 原始數據 | 轉換後 UnifiedDocument | 結果 |
|------|---------|---------------------|----------------------|------|
| edit.pdf | pp3_1_8 | saved_path="pp3_1_8.png" | content=字符串 | 圖片未放回 |
| edit3.pdf | pp3_1_2 | saved_path="pp3_1_2.png" | content=字符串 | 圖片未放回 |
**根因**: `ocr_to_unified_converter.py` `_convert_pp3_element` 方法中
```python
# 當前代碼 (第604-613行)
elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
content = {'path': elem_data.get('img_path', ''), ...}
else:
content = elem_data.get('content', '') # ← CHART 類型走這裡!
```
**問題**:
1. `CHART` 類型未被視為視覺元素
2. `saved_path` 完全丟失
3. `content` 變成文字而非圖片路徑
---
## 改善計劃
### 階段 1: Direct Track 使用 PyMuPDF find_tables (優先級:最高)
**問題**: `_detect_tables_by_position` 無法識別合併單元格
**方案**: 改用 PyMuPDF `find_tables()` API
**檔案**: `backend/app/services/direct_extraction_engine.py`
```python
def _extract_tables_with_pymupdf(self, page, page_num, counter):
tables = page.find_tables()
for table in tables.tables:
# 獲取 cells保留合併信息
cells = []
for row_idx in range(table.row_count):
for col_idx in range(table.col_count):
cell_data = table.cells[row_idx * table.col_count + col_idx]
if cell_data is None:
continue # 跳過被合併的單元格
# 計算 row_span/col_span...
```
### 階段 2: 修復 OCR Track 圖片路徑丟失 (優先級:最高)
**問題**: CHART 類型的 saved_path 在轉換時丟失
**檔案**: `backend/app/services/ocr_to_unified_converter.py`
**位置**: `_convert_pp3_element` 方法約第604行
**修改**:
```python
# 修改前
elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
# 修改後:包含所有視覺元素類型
elif element_type in [
ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART,
ElementType.DIAGRAM, ElementType.LOGO, ElementType.STAMP
]:
# 優先使用 saved_path
image_path = (
elem_data.get('saved_path') or
elem_data.get('img_path') or
''
)
content = {
'saved_path': image_path, # 關鍵:保留 saved_path
'path': image_path,
'width': elem_data.get('width', 0),
'height': elem_data.get('height', 0),
'format': elem_data.get('format', 'unknown')
}
```
### 階段 3: 修復 OCR Track cell_boxes 座標 (優先級:高)
**方案**: 驗證座標超出範圍時使用 CV 線檢測 fallback
### 階段 4: 過濾極小裝飾圖片 (優先級:高)
```python
if elem_area < 200:
continue # 跳過 < 200 px² 的圖片
```
### 階段 5: 過濾覆蓋圖像 (優先級:高)
在提取階段過濾與 covering_images 重疊的圖片
---
## 實施優先級
| 階段 | 描述 | 優先級 | 影響 |
|------|------|--------|------|
| 1 | Direct Track 使用 PyMuPDF find_tables | **最高** | 修復合併單元格 |
| 2 | **OCR Track 圖片路徑修復** | **最高** | 修復圖片未放回 |
| 3 | OCR Track cell_boxes 座標修復 | | 修復表格渲染錯亂 |
| 4 | 過濾極小裝飾圖片 | | 減少無意義圖片 |
| 5 | 過濾覆蓋圖像 | | 減少黑框 |
---
## 預期效果
### Direct Track
| 指標 | 修改前 | 修改後 |
|------|--------|--------|
| edit3.pdf cells | 204 (錯誤拆分) | 83 (正確識別合併) |
| 跨欄/跨行識別 | | |
### OCR Track 圖片
| 指標 | 修改前 | 修改後 |
|------|--------|--------|
| pp3_1_8 (edit.pdf) | 圖片未放回 | 正確放回 |
| pp3_1_2 (edit3.pdf) | 圖片未放回 | 正確放回 |
### OCR Track 表格
| 指標 | 修改前 | 修改後 |
|------|--------|--------|
| cell_boxes 座標 | 3/5 表格錯誤 | 全部正確或 CV fallback |
---
## 測試計劃
1. **edit.pdf Direct Track**: 確保無回歸
2. **edit3.pdf Direct Track**:
- 驗證表格識別到 83 cells 204
- 驗證跨欄/跨行正確
- 驗證極小圖片被過濾
- 驗證黑框被過濾
3. **edit.pdf OCR Track**:
- **驗證 pp3_1_8.png 正確放回**
- 驗證 cell_boxes 座標修復
4. **edit3.pdf OCR Track**:
- **驗證 pp3_1_2.png 正確放回**
- 驗證 cell_boxes 座標修復

File diff suppressed because it is too large Load Diff

View File

@@ -178,6 +178,114 @@ def trim_empty_columns(table_dict: Dict[str, Any]) -> Dict[str, Any]:
return result return result
def validate_cell_boxes(
cell_boxes: List[List[float]],
table_bbox: List[float],
page_width: float,
page_height: float,
tolerance: float = 5.0
) -> Dict[str, Any]:
"""
Validate cell_boxes coordinates against page boundaries and table bbox.
PP-StructureV3 sometimes returns cell_boxes with coordinates that exceed
page boundaries. This function validates and reports issues.
Args:
cell_boxes: List of cell bounding boxes [[x0, y0, x1, y1], ...]
table_bbox: Table bounding box [x0, y0, x1, y1]
page_width: Page width in pixels
page_height: Page height in pixels
tolerance: Allowed tolerance for boundary checks (pixels)
Returns:
Dict with:
- valid: bool - whether all cell_boxes are valid
- invalid_count: int - number of invalid cell_boxes
- clamped_boxes: List - cell_boxes clamped to valid boundaries
- issues: List[str] - description of issues found
"""
if not cell_boxes:
return {'valid': True, 'invalid_count': 0, 'clamped_boxes': [], 'issues': []}
issues = []
invalid_count = 0
clamped_boxes = []
# Page boundaries with tolerance
min_x = -tolerance
min_y = -tolerance
max_x = page_width + tolerance
max_y = page_height + tolerance
for idx, box in enumerate(cell_boxes):
if not box or len(box) < 4:
issues.append(f"Cell {idx}: Invalid box format")
invalid_count += 1
clamped_boxes.append([0, 0, 0, 0])
continue
x0, y0, x1, y1 = box[:4]
is_valid = True
cell_issues = []
# Check if coordinates exceed page boundaries
if x0 < min_x:
cell_issues.append(f"x0={x0:.1f} < 0")
is_valid = False
if y0 < min_y:
cell_issues.append(f"y0={y0:.1f} < 0")
is_valid = False
if x1 > max_x:
cell_issues.append(f"x1={x1:.1f} > page_width={page_width:.1f}")
is_valid = False
if y1 > max_y:
cell_issues.append(f"y1={y1:.1f} > page_height={page_height:.1f}")
is_valid = False
# Check for inverted coordinates
if x0 > x1:
cell_issues.append(f"x0={x0:.1f} > x1={x1:.1f}")
is_valid = False
if y0 > y1:
cell_issues.append(f"y0={y0:.1f} > y1={y1:.1f}")
is_valid = False
if not is_valid:
invalid_count += 1
issues.append(f"Cell {idx}: {', '.join(cell_issues)}")
# Clamp to valid boundaries
clamped_box = [
max(0, min(x0, page_width)),
max(0, min(y0, page_height)),
max(0, min(x1, page_width)),
max(0, min(y1, page_height))
]
# Ensure proper ordering after clamping
if clamped_box[0] > clamped_box[2]:
clamped_box[0], clamped_box[2] = clamped_box[2], clamped_box[0]
if clamped_box[1] > clamped_box[3]:
clamped_box[1], clamped_box[3] = clamped_box[3], clamped_box[1]
clamped_boxes.append(clamped_box)
if invalid_count > 0:
logger.warning(
f"Cell boxes validation: {invalid_count}/{len(cell_boxes)} invalid. "
f"Page: {page_width:.0f}x{page_height:.0f}, Table bbox: {table_bbox}"
)
return {
'valid': invalid_count == 0,
'invalid_count': invalid_count,
'clamped_boxes': clamped_boxes,
'issues': issues,
'needs_fallback': invalid_count > len(cell_boxes) * 0.5 # >50% invalid = needs fallback
}
class OCRToUnifiedConverter: class OCRToUnifiedConverter:
""" """
Converter for transforming PP-StructureV3 OCR results to UnifiedDocument format. Converter for transforming PP-StructureV3 OCR results to UnifiedDocument format.
@@ -337,19 +445,22 @@ class OCRToUnifiedConverter:
for page_idx, page_result in enumerate(enhanced_results): for page_idx, page_result in enumerate(enhanced_results):
elements = [] elements = []
# Get page dimensions first (needed for element conversion)
page_width = page_result.get('width', 0)
page_height = page_result.get('height', 0)
pp_dimensions = Dimensions(width=page_width, height=page_height)
# Process elements from parsing_res_list # Process elements from parsing_res_list
if 'elements' in page_result: if 'elements' in page_result:
for elem_data in page_result['elements']: for elem_data in page_result['elements']:
element = self._convert_pp3_element(elem_data, page_idx) element = self._convert_pp3_element(
elem_data, page_idx,
page_width=page_width,
page_height=page_height
)
if element: if element:
elements.append(element) elements.append(element)
# Get page dimensions
pp_dimensions = Dimensions(
width=page_result.get('width', 0),
height=page_result.get('height', 0)
)
# Apply gap filling if enabled and raw regions available # Apply gap filling if enabled and raw regions available
if self.gap_filling_service and raw_text_regions: if self.gap_filling_service and raw_text_regions:
# Filter raw regions for current page # Filter raw regions for current page
@@ -556,9 +667,19 @@ class OCRToUnifiedConverter:
def _convert_pp3_element( def _convert_pp3_element(
self, self,
elem_data: Dict[str, Any], elem_data: Dict[str, Any],
page_idx: int page_idx: int,
page_width: float = 0,
page_height: float = 0
) -> Optional[DocumentElement]: ) -> Optional[DocumentElement]:
"""Convert PP-StructureV3 element to DocumentElement.""" """
Convert PP-StructureV3 element to DocumentElement.
Args:
elem_data: Element data from PP-StructureV3
page_idx: Page index (0-based)
page_width: Page width for coordinate validation
page_height: Page height for coordinate validation
"""
try: try:
# Extract bbox # Extract bbox
bbox_data = elem_data.get('bbox', [0, 0, 0, 0]) bbox_data = elem_data.get('bbox', [0, 0, 0, 0])
@@ -597,18 +718,67 @@ class OCRToUnifiedConverter:
# Preserve cell_boxes and embedded_images in metadata for PDF generation # Preserve cell_boxes and embedded_images in metadata for PDF generation
# These are extracted by PP-StructureV3 and provide accurate cell positioning # These are extracted by PP-StructureV3 and provide accurate cell positioning
if 'cell_boxes' in elem_data: if 'cell_boxes' in elem_data:
elem_data.setdefault('metadata', {})['cell_boxes'] = elem_data['cell_boxes'] cell_boxes = elem_data['cell_boxes']
elem_data['metadata']['cell_boxes_source'] = elem_data.get('cell_boxes_source', 'table_res_list') elem_data.setdefault('metadata', {})['cell_boxes_source'] = elem_data.get('cell_boxes_source', 'table_res_list')
# Validate cell_boxes coordinates if page dimensions are available
if page_width > 0 and page_height > 0:
validation = validate_cell_boxes(
cell_boxes=cell_boxes,
table_bbox=bbox_data,
page_width=page_width,
page_height=page_height
)
if not validation['valid']:
elem_data['metadata']['cell_boxes_validation'] = {
'valid': False,
'invalid_count': validation['invalid_count'],
'total_count': len(cell_boxes),
'needs_fallback': validation['needs_fallback']
}
# Use clamped boxes instead of invalid ones
elem_data['metadata']['cell_boxes'] = validation['clamped_boxes']
elem_data['metadata']['cell_boxes_original'] = cell_boxes
if validation['needs_fallback']:
logger.warning(
f"Table {elem_data.get('element_id')}: "
f"{validation['invalid_count']}/{len(cell_boxes)} cell_boxes invalid, "
f"fallback recommended"
)
else:
elem_data['metadata']['cell_boxes'] = cell_boxes
elem_data['metadata']['cell_boxes_validation'] = {'valid': True}
else:
# No page dimensions available, store as-is
elem_data['metadata']['cell_boxes'] = cell_boxes
if 'embedded_images' in elem_data: if 'embedded_images' in elem_data:
elem_data.setdefault('metadata', {})['embedded_images'] = elem_data['embedded_images'] elem_data.setdefault('metadata', {})['embedded_images'] = elem_data['embedded_images']
elif element_type in [ElementType.IMAGE, ElementType.FIGURE]: elif element_type in [
# For images, use metadata dict as content ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART,
ElementType.DIAGRAM, ElementType.LOGO, ElementType.STAMP
]:
# For all visual elements, use metadata dict as content
# Priority: saved_path > img_path (PP-StructureV3 uses saved_path)
image_path = (
elem_data.get('saved_path') or
elem_data.get('img_path') or
''
)
content = { content = {
'path': elem_data.get('img_path', ''), 'saved_path': image_path, # Preserve original path key
'path': image_path, # For backward compatibility
'width': elem_data.get('width', 0), 'width': elem_data.get('width', 0),
'height': elem_data.get('height', 0), 'height': elem_data.get('height', 0),
'format': elem_data.get('format', 'unknown') 'format': elem_data.get('format', 'unknown')
} }
if not image_path:
logger.warning(
f"Visual element {element_type.value} missing image path: "
f"saved_path={elem_data.get('saved_path')}, img_path={elem_data.get('img_path')}"
)
else: else:
content = elem_data.get('content', '') content = elem_data.get('content', '')
@@ -1139,10 +1309,18 @@ class OCRToUnifiedConverter:
for page_idx, page_data in enumerate(pages_data): for page_idx, page_data in enumerate(pages_data):
elements = [] elements = []
# Get page dimensions first
page_width = page_data.get('width', 0)
page_height = page_data.get('height', 0)
# Process each element in the page # Process each element in the page
if 'elements' in page_data: if 'elements' in page_data:
for elem_data in page_data['elements']: for elem_data in page_data['elements']:
element = self._convert_pp3_element(elem_data, page_idx) element = self._convert_pp3_element(
elem_data, page_idx,
page_width=page_width,
page_height=page_height
)
if element: if element:
elements.append(element) elements.append(element)
@@ -1150,8 +1328,8 @@ class OCRToUnifiedConverter:
page = Page( page = Page(
page_number=page_idx + 1, page_number=page_idx + 1,
dimensions=Dimensions( dimensions=Dimensions(
width=page_data.get('width', 0), width=page_width,
height=page_data.get('height', 0) height=page_height
), ),
elements=elements, elements=elements,
metadata={'reading_order': self._calculate_reading_order(elements)} metadata={'reading_order': self._calculate_reading_order(elements)}

View File

@@ -3371,18 +3371,21 @@ class PDFGeneratorService:
"rows": 6, "rows": 6,
"cols": 2, "cols": 2,
"cells": [ "cells": [
{"row": 0, "col": 0, "content": "..."}, {"row": 0, "col": 0, "content": "...", "row_span": 1, "col_span": 2},
{"row": 0, "col": 1, "content": "..."}, {"row": 0, "col": 1, "content": "..."},
... ...
] ]
} }
Returns format compatible with HTMLTableParser output: Returns format compatible with HTMLTableParser output (with colspan/rowspan/col):
[ [
{"cells": [{"text": "..."}, {"text": "..."}]}, # row 0 {"cells": [{"text": "...", "colspan": 1, "rowspan": 1, "col": 0}, ...]},
{"cells": [{"text": "..."}, {"text": "..."}]}, # row 1 {"cells": [{"text": "...", "colspan": 1, "rowspan": 1, "col": 0}, ...]},
... ...
] ]
Note: This returns actual cells per row with their absolute column positions.
The table renderer uses 'col' to place cells correctly in the grid.
""" """
try: try:
num_rows = content.get('rows', 0) num_rows = content.get('rows', 0)
@@ -3392,21 +3395,39 @@ class PDFGeneratorService:
if not cells or num_rows == 0 or num_cols == 0: if not cells or num_rows == 0 or num_cols == 0:
return [] return []
# Initialize rows structure # Group cells by row
rows_data = [] cells_by_row = {}
for _ in range(num_rows):
rows_data.append({'cells': [{'text': ''} for _ in range(num_cols)]})
# Fill in cell content
for cell in cells: for cell in cells:
row_idx = cell.get('row', 0) row_idx = cell.get('row', 0)
col_idx = cell.get('col', 0) if row_idx not in cells_by_row:
cells_by_row[row_idx] = []
cells_by_row[row_idx].append(cell)
# Sort cells within each row by column
for row_idx in cells_by_row:
cells_by_row[row_idx].sort(key=lambda c: c.get('col', 0))
# Build rows structure with colspan/rowspan info and absolute col position
rows_data = []
for row_idx in range(num_rows):
row_cells = []
if row_idx in cells_by_row:
for cell in cells_by_row[row_idx]:
cell_content = cell.get('content', '') cell_content = cell.get('content', '')
row_span = cell.get('row_span', 1) or 1
col_span = cell.get('col_span', 1) or 1
col_idx = cell.get('col', 0)
if 0 <= row_idx < num_rows and 0 <= col_idx < num_cols: row_cells.append({
rows_data[row_idx]['cells'][col_idx]['text'] = str(cell_content) if cell_content else '' 'text': str(cell_content) if cell_content else '',
'rowspan': row_span,
'colspan': col_span,
'col': col_idx # Absolute column position
})
logger.debug(f"Built {num_rows} rows from cells dict") rows_data.append({'cells': row_cells})
logger.debug(f"Built {num_rows} rows from cells dict with span info")
return rows_data return rows_data
except Exception as e: except Exception as e:
@@ -3471,19 +3492,115 @@ class PDFGeneratorService:
table_width = bbox.x1 - bbox.x0 table_width = bbox.x1 - bbox.x0
table_height = bbox.y1 - bbox.y0 table_height = bbox.y1 - bbox.y0
# Build table data for ReportLab
table_content = []
for row in rows:
row_data = [cell['text'].strip() for cell in row['cells']]
table_content.append(row_data)
# Create table # Create table
from reportlab.platypus import Table, TableStyle from reportlab.platypus import Table, TableStyle
from reportlab.lib import colors from reportlab.lib import colors
# Determine number of rows and columns for cell_boxes calculation # Determine grid size from rows structure
# Note: rows may have 'col' attribute for absolute positioning (from Direct extraction)
# or may be sequential (from HTML parsing)
num_rows = len(rows) num_rows = len(rows)
max_cols = max(len(row['cells']) for row in rows) if rows else 0
# Check if cells have absolute column positions
has_absolute_cols = any(
'col' in cell
for row in rows
for cell in row['cells']
)
# Calculate actual number of columns
max_cols = 0
if has_absolute_cols:
# Use absolute col positions + colspan to find max column
for row in rows:
for cell in row['cells']:
col = cell.get('col', 0)
colspan = cell.get('colspan', 1)
max_cols = max(max_cols, col + colspan)
else:
# Sequential cells: sum up colspans
for row in rows:
col_pos = 0
for cell in row['cells']:
colspan = cell.get('colspan', 1)
col_pos += colspan
max_cols = max(max_cols, col_pos)
# Build table data for ReportLab with proper grid structure
# ReportLab needs a full grid with placeholders for spanned cells
# and SPAN commands to merge them
table_content = []
span_commands = []
covered = set() # Track cells covered by spans
# First pass: mark covered cells and collect SPAN commands
for row_idx, row in enumerate(rows):
if has_absolute_cols:
# Use absolute column positions
for cell in row['cells']:
col_pos = cell.get('col', 0)
colspan = cell.get('colspan', 1)
rowspan = cell.get('rowspan', 1)
# Mark cells covered by this span
if colspan > 1 or rowspan > 1:
for r in range(row_idx, row_idx + rowspan):
for c in range(col_pos, col_pos + colspan):
if (r, c) != (row_idx, col_pos):
covered.add((r, c))
# Add SPAN command for ReportLab
span_commands.append((
'SPAN',
(col_pos, row_idx),
(col_pos + colspan - 1, row_idx + rowspan - 1)
))
else:
# Sequential positioning
col_pos = 0
for cell in row['cells']:
while (row_idx, col_pos) in covered:
col_pos += 1
colspan = cell.get('colspan', 1)
rowspan = cell.get('rowspan', 1)
if colspan > 1 or rowspan > 1:
for r in range(row_idx, row_idx + rowspan):
for c in range(col_pos, col_pos + colspan):
if (r, c) != (row_idx, col_pos):
covered.add((r, c))
span_commands.append((
'SPAN',
(col_pos, row_idx),
(col_pos + colspan - 1, row_idx + rowspan - 1)
))
col_pos += colspan
# Second pass: build content grid
for row_idx in range(num_rows):
row_data = [''] * max_cols
if row_idx < len(rows):
if has_absolute_cols:
# Place cells at their absolute positions
for cell in rows[row_idx]['cells']:
col_pos = cell.get('col', 0)
if col_pos < max_cols:
row_data[col_pos] = cell['text'].strip()
else:
# Sequential placement
col_pos = 0
for cell in rows[row_idx]['cells']:
while col_pos < max_cols and (row_idx, col_pos) in covered:
col_pos += 1
if col_pos < max_cols:
row_data[col_pos] = cell['text'].strip()
colspan = cell.get('colspan', 1)
col_pos += colspan
table_content.append(row_data)
logger.debug(f"Built table grid: {num_rows} rows × {max_cols} cols, {len(span_commands)} span commands (absolute_cols={has_absolute_cols})")
# Use original column widths from extraction if available # Use original column widths from extraction if available
# Otherwise try to compute from cell_boxes (from PP-StructureV3) # Otherwise try to compute from cell_boxes (from PP-StructureV3)
@@ -3517,7 +3634,7 @@ class PDFGeneratorService:
# Apply style with minimal padding to reduce table extension # Apply style with minimal padding to reduce table extension
# Use Chinese font to support special characters (℃, μm, ≦, ×, Ω, etc.) # Use Chinese font to support special characters (℃, μm, ≦, ×, Ω, etc.)
font_for_table = self.font_name if self.font_registered else 'Helvetica' font_for_table = self.font_name if self.font_registered else 'Helvetica'
style = TableStyle([ style_commands = [
('GRID', (0, 0), (-1, -1), 0.5, colors.grey), ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
('FONTNAME', (0, 0), (-1, -1), font_for_table), ('FONTNAME', (0, 0), (-1, -1), font_for_table),
('FONTSIZE', (0, 0), (-1, -1), 8), ('FONTSIZE', (0, 0), (-1, -1), 8),
@@ -3529,7 +3646,13 @@ class PDFGeneratorService:
('BOTTOMPADDING', (0, 0), (-1, -1), 0), ('BOTTOMPADDING', (0, 0), (-1, -1), 0),
('LEFTPADDING', (0, 0), (-1, -1), 1), ('LEFTPADDING', (0, 0), (-1, -1), 1),
('RIGHTPADDING', (0, 0), (-1, -1), 1), ('RIGHTPADDING', (0, 0), (-1, -1), 1),
]) ]
# Add span commands for merged cells
style_commands.extend(span_commands)
if span_commands:
logger.info(f"Applied {len(span_commands)} SPAN commands for merged cells")
style = TableStyle(style_commands)
t.setStyle(style) t.setStyle(style)
# Use canvas scaling as fallback to fit table within bbox # Use canvas scaling as fallback to fit table within bbox
@@ -4350,30 +4473,97 @@ class PDFGeneratorService:
# Replace newlines with <br/> # Replace newlines with <br/>
safe_content = safe_content.replace('\n', '<br/>') safe_content = safe_content.replace('\n', '<br/>')
# Calculate font size from bbox height, but keep minimum 10pt # Get original font size from style info
font_size = max(box_height * 0.7, 10) style_info = elem.get('style', {})
font_size = min(font_size, 24) # Cap at 24pt original_font_size = style_info.get('font_size', 12.0)
# Create style for this element # Detect vertical text (Y-axis labels, etc.)
elem_style = ParagraphStyle( # Vertical text has aspect_ratio (height/width) > 2 and multiple characters
f'elem_{id(elem)}', is_vertical_text = (
parent=base_style, box_height > box_width * 2 and
fontSize=font_size, len(content.strip()) > 1
leading=font_size * 1.2, )
if is_vertical_text:
# For vertical text, use original font size and rotate
font_size = min(original_font_size, box_width * 0.9)
font_size = max(font_size, 6) # Minimum 6pt
# Save canvas state for rotation
pdf_canvas.saveState()
# Convert to PDF coordinates
pdf_y_center = current_page_height - (y0 + y1) / 2
x_center = (x0 + x1) / 2
# Translate to center, rotate, translate back
pdf_canvas.translate(x_center, pdf_y_center)
pdf_canvas.rotate(90)
# Set font and draw text centered
pdf_canvas.setFont(
self.font_name if self.font_registered else 'Helvetica',
font_size
)
# Draw text at origin (since we translated to center)
text_width = pdf_canvas.stringWidth(
safe_content.replace('&amp;', '&').replace('&lt;', '<').replace('&gt;', '>'),
self.font_name if self.font_registered else 'Helvetica',
font_size
)
pdf_canvas.drawString(-text_width / 2, -font_size / 3,
safe_content.replace('&amp;', '&').replace('&lt;', '<').replace('&gt;', '>'))
pdf_canvas.restoreState()
else:
# For horizontal text, dynamically fit text within bbox
# Start with original font size and reduce until text fits
MIN_FONT_SIZE = 6
MAX_FONT_SIZE = 14
if original_font_size > 0:
start_font_size = min(original_font_size, MAX_FONT_SIZE)
else:
start_font_size = min(box_height * 0.7, MAX_FONT_SIZE)
font_size = max(start_font_size, MIN_FONT_SIZE)
# Try progressively smaller font sizes until text fits
para = None
para_height = box_height + 1 # Start with height > box to enter loop
while font_size >= MIN_FONT_SIZE and para_height > box_height:
elem_style = ParagraphStyle(
f'elem_{id(elem)}_{font_size}',
parent=base_style,
fontSize=font_size,
leading=font_size * 1.15, # Tighter leading
) )
# Create paragraph
para = Paragraph(safe_content, elem_style) para = Paragraph(safe_content, elem_style)
para_width, para_height = para.wrap(box_width, box_height * 3)
# Calculate available width and height if para_height <= box_height:
available_width = box_width break # Text fits!
available_height = box_height * 2 # Allow overflow
# Wrap the paragraph font_size -= 0.5 # Reduce font size and try again
para_width, para_height = para.wrap(available_width, available_height)
# Ensure minimum font size
if font_size < MIN_FONT_SIZE:
font_size = MIN_FONT_SIZE
elem_style = ParagraphStyle(
f'elem_{id(elem)}_min',
parent=base_style,
fontSize=font_size,
leading=font_size * 1.15,
)
para = Paragraph(safe_content, elem_style)
para_width, para_height = para.wrap(box_width, box_height * 3)
# Convert to PDF coordinates (y from bottom) # Convert to PDF coordinates (y from bottom)
pdf_y = current_page_height - y0 - para_height # Clip to bbox height to prevent overflow
actual_height = min(para_height, box_height)
pdf_y = current_page_height - y0 - actual_height
# Draw the paragraph # Draw the paragraph
para.drawOn(pdf_canvas, x0, pdf_y) para.drawOn(pdf_canvas, x0, pdf_y)
@@ -4451,13 +4641,47 @@ class PDFGeneratorService:
pdf_y_bottom = page_height - ty1 pdf_y_bottom = page_height - ty1
pdf_canvas.rect(tx0, pdf_y_bottom, table_width, table_height, stroke=1, fill=0) pdf_canvas.rect(tx0, pdf_y_bottom, table_width, table_height, stroke=1, fill=0)
# Step 2: Draw cell borders using cell_boxes # Step 2: Get or calculate cell boxes
cell_boxes = metadata.get('cell_boxes', []) cell_boxes = metadata.get('cell_boxes', [])
if cell_boxes:
# If no cell_boxes, calculate from column_widths and row_heights
if not cell_boxes:
column_widths = metadata.get('column_widths', [])
row_heights = metadata.get('row_heights', [])
if column_widths and row_heights:
# Calculate cell positions from widths and heights
cell_boxes = []
rows = content.get('rows', len(row_heights)) if isinstance(content, dict) else len(row_heights)
cols = content.get('cols', len(column_widths)) if isinstance(content, dict) else len(column_widths)
# Calculate cumulative positions
x_positions = [tx0]
for w in column_widths[:cols]:
x_positions.append(x_positions[-1] + w)
y_positions = [ty0]
for h in row_heights[:rows]:
y_positions.append(y_positions[-1] + h)
# Create cell boxes for each cell (row-major order)
for row_idx in range(rows):
for col_idx in range(cols):
if col_idx < len(x_positions) - 1 and row_idx < len(y_positions) - 1:
cx0 = x_positions[col_idx]
cy0 = y_positions[row_idx]
cx1 = x_positions[col_idx + 1]
cy1 = y_positions[row_idx + 1]
cell_boxes.append([cx0, cy0, cx1, cy1])
logger.debug(f"Calculated {len(cell_boxes)} cell boxes from {cols} cols x {rows} rows")
# Normalize cell boxes for grid alignment # Normalize cell boxes for grid alignment
if hasattr(self, '_normalize_cell_boxes_to_grid'): if cell_boxes and hasattr(self, '_normalize_cell_boxes_to_grid'):
cell_boxes = self._normalize_cell_boxes_to_grid(cell_boxes) cell_boxes = self._normalize_cell_boxes_to_grid(cell_boxes)
# Draw cell borders
if cell_boxes:
pdf_canvas.setLineWidth(0.5) pdf_canvas.setLineWidth(0.5)
for box in cell_boxes: for box in cell_boxes:
if len(box) >= 4: if len(box) >= 4:

View File

@@ -558,8 +558,8 @@ class PPStructureEnhanced:
element['embedded_images'] = embedded_images element['embedded_images'] = embedded_images
logger.info(f"[TABLE] Embedded {len(embedded_images)} images into table") logger.info(f"[TABLE] Embedded {len(embedded_images)} images into table")
# Special handling for images/figures/stamps (visual elements that need cropping) # Special handling for images/figures/charts/stamps (visual elements that need cropping)
elif mapped_type in [ElementType.IMAGE, ElementType.FIGURE, ElementType.STAMP, ElementType.LOGO]: elif mapped_type in [ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART, ElementType.DIAGRAM, ElementType.STAMP, ElementType.LOGO]:
# Save image if path provided # Save image if path provided
if 'img_path' in item and output_dir: if 'img_path' in item and output_dir:
saved_path = self._save_image(item['img_path'], output_dir, element['element_id']) saved_path = self._save_image(item['img_path'], output_dir, element['element_id'])

View File

@@ -0,0 +1,43 @@
"""Debug PyMuPDF table.cells structure"""
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import fitz
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
doc = fitz.open(str(pdf_path))
page = doc[0]
tables = page.find_tables()
for idx, table in enumerate(tables.tables):
data = table.extract()
num_rows = len(data)
num_cols = max(len(row) for row in data) if data else 0
print(f"Table {idx}:")
print(f" table.extract() dimensions: {num_rows} rows x {num_cols} cols")
print(f" Expected positions: {num_rows * num_cols}")
cell_rects = getattr(table, 'cells', None)
if cell_rects:
print(f" table.cells length: {len(cell_rects)}")
none_count = sum(1 for c in cell_rects if c is None)
actual_count = sum(1 for c in cell_rects if c is not None)
print(f" None cells: {none_count}")
print(f" Actual cells: {actual_count}")
# Check if cell_rects matches grid size
if len(cell_rects) != num_rows * num_cols:
print(f" WARNING: cell_rects length ({len(cell_rects)}) != grid size ({num_rows * num_cols})")
# Show first few cells
print(f" First 5 cells: {cell_rects[:5]}")
else:
print(f" table.cells: NOT AVAILABLE")
# Check row_count and col_count
print(f" table.row_count: {getattr(table, 'row_count', 'N/A')}")
print(f" table.col_count: {getattr(table, 'col_count', 'N/A')}")
doc.close()

View File

@@ -0,0 +1,48 @@
"""Debug PyMuPDF table structure - find merge info"""
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent))
import fitz
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
doc = fitz.open(str(pdf_path))
page = doc[0]
tables = page.find_tables()
for idx, table in enumerate(tables.tables):
print(f"\nTable {idx}:")
# Check all available attributes
print(f" Available attributes: {[a for a in dir(table) if not a.startswith('_')]}")
# Try to get header info
if hasattr(table, 'header'):
print(f" header: {table.header}")
# Check for cells info
cell_rects = table.cells
print(f" cells count: {len(cell_rects)}")
# Get the extracted data
data = table.extract()
print(f" extract() shape: {len(data)} x {max(len(r) for r in data)}")
# Check if there's a way to map cells to grid positions
# Look at the pandas output which might have merge info
try:
df = table.to_pandas()
print(f" pandas shape: {df.shape}")
except Exception as e:
print(f" pandas error: {e}")
# Check the TableRow objects if available
if hasattr(table, 'rows'):
rows = table.rows
print(f" rows: {len(rows)}")
for ri, row in enumerate(rows[:3]): # first 3 rows
print(f" row {ri}: {len(row.cells)} cells")
for ci, cell in enumerate(row.cells[:5]): # first 5 cells
print(f" cell {ci}: bbox={cell}")
doc.close()

View File

@@ -0,0 +1,111 @@
"""
Generate test PDF to verify Phase 1 fixes
"""
import sys
import os
from pathlib import Path
# Add backend to path
sys.path.insert(0, str(Path(__file__).parent.parent))
from app.services.direct_extraction_engine import DirectExtractionEngine
from app.services.pdf_generator_service import PDFGeneratorService
from app.services.unified_document_exporter import UnifiedDocumentExporter
def generate_test_pdf(input_pdf: str, output_dir: Path):
"""Generate test PDF using Direct Track extraction"""
input_path = Path(input_pdf)
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Processing: {input_path.name}")
print(f"Output dir: {output_dir}")
# Step 1: Extract with Direct Track
engine = DirectExtractionEngine(
enable_table_detection=True,
enable_image_extraction=True,
min_image_area=200.0, # Filter tiny images
enable_whiteout_detection=True,
enable_content_sanitization=True
)
unified_doc = engine.extract(input_path, output_dir=output_dir)
# Print extraction stats
print(f"\n=== Extraction Results ===")
print(f"Document ID: {unified_doc.document_id}")
print(f"Pages: {len(unified_doc.pages)}")
table_count = 0
image_count = 0
merged_cells = 0
total_cells = 0
for page in unified_doc.pages:
for elem in page.elements:
if elem.type.value == 'table':
table_count += 1
if elem.content and hasattr(elem.content, 'cells'):
total_cells += len(elem.content.cells)
for cell in elem.content.cells:
if cell.row_span > 1 or cell.col_span > 1:
merged_cells += 1
elif elem.type.value == 'image':
image_count += 1
print(f"Tables: {table_count}")
print(f" - Total cells: {total_cells}")
print(f" - Merged cells: {merged_cells}")
print(f"Images: {image_count}")
# Step 2: Export to JSON
exporter = UnifiedDocumentExporter()
json_path = output_dir / f"{input_path.stem}_result.json"
exporter.export_to_json(unified_doc, json_path)
print(f"\nJSON saved: {json_path}")
# Step 3: Generate layout PDF
pdf_generator = PDFGeneratorService()
pdf_path = output_dir / f"{input_path.stem}_layout.pdf"
try:
pdf_generator.generate_from_unified_document(
unified_doc=unified_doc,
output_path=pdf_path,
source_file_path=input_path
)
print(f"PDF saved: {pdf_path}")
return pdf_path
except Exception as e:
print(f"PDF generation error: {e}")
import traceback
traceback.print_exc()
return None
if __name__ == "__main__":
# Test with edit3.pdf (has complex tables with merging)
demo_docs = Path(__file__).parent.parent.parent / "demo_docs"
output_base = Path(__file__).parent.parent / "storage" / "test_phase1"
# Process edit3.pdf
edit3_pdf = demo_docs / "edit3.pdf"
if edit3_pdf.exists():
output_dir = output_base / "edit3"
result = generate_test_pdf(str(edit3_pdf), output_dir)
if result:
print(f"\n✓ Test PDF generated: {result}")
# Also process edit.pdf for comparison
edit_pdf = demo_docs / "edit.pdf"
if edit_pdf.exists():
output_dir = output_base / "edit"
result = generate_test_pdf(str(edit_pdf), output_dir)
if result:
print(f"\n✓ Test PDF generated: {result}")
print(f"\n=== Output Location ===")
print(f"{output_base}")

View File

@@ -0,0 +1,285 @@
"""
Phase 1 Bug Fixes Verification Tests
Tests for:
1.1 Direct Track table cell merging
1.2 OCR Track image path preservation
1.3 Cell boxes coordinate validation
1.4 Tiny decoration image filtering
1.5 Covering image removal
"""
import sys
import os
from pathlib import Path
# Add backend to path
sys.path.insert(0, str(Path(__file__).parent.parent))
import fitz
from app.services.direct_extraction_engine import DirectExtractionEngine
from app.services.ocr_to_unified_converter import validate_cell_boxes
from app.models.unified_document import TableCell
def test_1_1_table_cell_merging():
"""Test 1.1.5: Verify edit3.pdf returns correct merged cells"""
print("\n" + "="*60)
print("TEST 1.1: Direct Track Table Cell Merging")
print("="*60)
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
if not pdf_path.exists():
print(f"SKIP: {pdf_path} not found")
return False
doc = fitz.open(str(pdf_path))
total_cells = 0
merged_cells = 0
for page_num, page in enumerate(doc):
tables = page.find_tables()
for table_idx, table in enumerate(tables.tables):
data = table.extract()
cell_rects = getattr(table, 'cells', None)
if cell_rects:
num_rows = len(data)
num_cols = max(len(row) for row in data) if data else 0
# Count actual cells (non-None)
actual_cells = sum(1 for c in cell_rects if c is not None)
none_cells = sum(1 for c in cell_rects if c is None)
print(f" Page {page_num}, Table {table_idx}:")
print(f" Grid size: {num_rows} x {num_cols} = {num_rows * num_cols} positions")
print(f" Actual cells: {actual_cells}")
print(f" Merged positions (None): {none_cells}")
total_cells += actual_cells
if none_cells > 0:
merged_cells += 1
doc.close()
print(f"\n Total actual cells across all tables: {total_cells}")
print(f" Tables with merging: {merged_cells}")
# According to PLAN.md, edit3.pdf should have 83 cells (not 204)
# The presence of None values indicates merging is detected
if total_cells > 0 and total_cells < 204:
print(" RESULT: PASS - Cell merging detected correctly")
return True
elif total_cells == 204:
print(" RESULT: FAIL - All cells treated as 1x1 (no merging detected)")
return False
else:
print(f" RESULT: INCONCLUSIVE - {total_cells} cells found")
return None
def test_1_3_cell_boxes_validation():
"""Test 1.3: Verify cell_boxes coordinate validation"""
print("\n" + "="*60)
print("TEST 1.3: Cell Boxes Coordinate Validation")
print("="*60)
# Test case 1: Valid coordinates
valid_boxes = [
[10, 10, 100, 50],
[100, 10, 200, 50],
[10, 50, 200, 100]
]
result = validate_cell_boxes(valid_boxes, [0, 0, 300, 200], 300, 200)
print(f" Valid boxes: valid={result['valid']}, invalid_count={result['invalid_count']}")
assert result['valid'], "Valid boxes should pass validation"
# Test case 2: Out of bounds coordinates
invalid_boxes = [
[-10, 10, 100, 50], # x0 < 0
[10, 10, 400, 50], # x1 > page_width
[10, 10, 100, 300] # y1 > page_height
]
result = validate_cell_boxes(invalid_boxes, [0, 0, 300, 200], 300, 200)
print(f" Invalid boxes: valid={result['valid']}, invalid_count={result['invalid_count']}")
assert not result['valid'], "Invalid boxes should fail validation"
assert result['invalid_count'] == 3, "Should detect 3 invalid boxes"
# Test case 3: Clamping
assert len(result['clamped_boxes']) == 3, "Should return clamped boxes"
clamped = result['clamped_boxes'][0]
assert clamped[0] >= 0, "Clamped x0 should be >= 0"
print(" RESULT: PASS - Coordinate validation works correctly")
return True
def test_1_4_tiny_image_filtering():
"""Test 1.4: Verify tiny decoration image filtering"""
print("\n" + "="*60)
print("TEST 1.4: Tiny Decoration Image Filtering")
print("="*60)
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
if not pdf_path.exists():
print(f"SKIP: {pdf_path} not found")
return None
doc = fitz.open(str(pdf_path))
tiny_count = 0
normal_count = 0
min_area = 200 # Same threshold as in DirectExtractionEngine
for page_num, page in enumerate(doc):
images = page.get_images()
for img in images:
xref = img[0]
rects = page.get_image_rects(xref)
if rects:
rect = rects[0]
area = (rect.x1 - rect.x0) * (rect.y1 - rect.y0)
if area < min_area:
tiny_count += 1
print(f" Page {page_num}: Tiny image xref={xref}, area={area:.1f} px²")
else:
normal_count += 1
doc.close()
print(f"\n Tiny images (< {min_area} px²): {tiny_count}")
print(f" Normal images: {normal_count}")
if tiny_count > 0:
print(" RESULT: PASS - Tiny images detected, will be filtered")
return True
else:
print(" RESULT: INFO - No tiny images found in test file")
return None
def test_1_5_covering_image_detection():
"""Test 1.5: Verify covering image detection"""
print("\n" + "="*60)
print("TEST 1.5: Covering Image Detection")
print("="*60)
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
if not pdf_path.exists():
print(f"SKIP: {pdf_path} not found")
return None
engine = DirectExtractionEngine(
enable_whiteout_detection=True,
whiteout_iou_threshold=0.8
)
doc = fitz.open(str(pdf_path))
total_covering = 0
for page_num, page in enumerate(doc):
result = engine._preprocess_page(page, page_num, doc)
covering_images = result.get('covering_images', [])
if covering_images:
print(f" Page {page_num}: {len(covering_images)} covering images detected")
for img in covering_images[:3]: # Show first 3
print(f" - xref={img.get('xref')}, type={img.get('color_type')}, "
f"bbox={[round(x, 1) for x in img.get('bbox', [])]}")
total_covering += len(covering_images)
doc.close()
print(f"\n Total covering images detected: {total_covering}")
if total_covering > 0:
print(" RESULT: PASS - Covering images detected, will be filtered")
return True
else:
print(" RESULT: INFO - No covering images found in test file")
return None
def test_direct_extraction_full():
"""Full integration test for Direct Track extraction"""
print("\n" + "="*60)
print("INTEGRATION TEST: Direct Track Full Extraction")
print("="*60)
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
if not pdf_path.exists():
print(f"SKIP: {pdf_path} not found")
return None
engine = DirectExtractionEngine(
enable_table_detection=True,
enable_image_extraction=True,
min_image_area=200.0,
enable_whiteout_detection=True
)
try:
result = engine.extract(pdf_path) # Pass Path object, not string
# Count elements
table_count = 0
image_count = 0
merged_table_count = 0
for page in result.pages:
for elem in page.elements:
if elem.type.value == 'table':
table_count += 1
if elem.content and hasattr(elem.content, 'cells'):
# Check for merged cells
for cell in elem.content.cells:
if cell.row_span > 1 or cell.col_span > 1:
merged_table_count += 1
break
elif elem.type.value == 'image':
image_count += 1
print(f" Document ID: {result.document_id}")
print(f" Pages: {len(result.pages)}")
print(f" Tables: {table_count} (with merging: {merged_table_count})")
print(f" Images: {image_count}")
print(" RESULT: PASS - Extraction completed successfully")
return True
except Exception as e:
print(f" RESULT: FAIL - {e}")
import traceback
traceback.print_exc()
return False
if __name__ == "__main__":
print("="*60)
print("Phase 1 Bug Fixes Verification Tests")
print("="*60)
results = {}
# Run tests
results['1.1_table_merging'] = test_1_1_table_cell_merging()
results['1.3_coord_validation'] = test_1_3_cell_boxes_validation()
results['1.4_tiny_filtering'] = test_1_4_tiny_image_filtering()
results['1.5_covering_detection'] = test_1_5_covering_image_detection()
results['integration'] = test_direct_extraction_full()
# Summary
print("\n" + "="*60)
print("TEST SUMMARY")
print("="*60)
for test_name, result in results.items():
status = "PASS" if result is True else "FAIL" if result is False else "SKIP/INFO"
print(f" {test_name}: {status}")
passed = sum(1 for r in results.values() if r is True)
failed = sum(1 for r in results.values() if r is False)
skipped = sum(1 for r in results.values() if r is None)
print(f"\n Total: {passed} passed, {failed} failed, {skipped} skipped/info")

View File

@@ -0,0 +1,148 @@
import { Card, CardContent, CardHeader, CardTitle } from '@/components/ui/card'
import { Badge } from '@/components/ui/badge'
import { Cpu, FileText, Sparkles, Info } from 'lucide-react'
import type { ProcessingTrack, DocumentAnalysisResponse } from '@/types/apiV2'
interface ProcessingTrackSelectorProps {
value: ProcessingTrack | null // null means "use system recommendation"
onChange: (track: ProcessingTrack | null) => void
documentAnalysis?: DocumentAnalysisResponse | null
disabled?: boolean
}
export default function ProcessingTrackSelector({
value,
onChange,
documentAnalysis,
disabled = false,
}: ProcessingTrackSelectorProps) {
const recommendedTrack = documentAnalysis?.recommended_track
const tracks = [
{
id: null as ProcessingTrack | null,
name: '自動選擇',
description: '根據文件類型自動選擇最佳處理方式',
icon: Sparkles,
color: 'text-purple-600',
bgColor: 'bg-purple-50',
borderColor: 'border-purple-200',
recommended: false,
},
{
id: 'direct' as ProcessingTrack,
name: '直接提取 (DIRECT)',
description: '從 PDF 中直接提取文字圖層,適用於可編輯 PDF',
icon: FileText,
color: 'text-blue-600',
bgColor: 'bg-blue-50',
borderColor: 'border-blue-200',
recommended: recommendedTrack === 'direct',
},
{
id: 'ocr' as ProcessingTrack,
name: 'OCR 識別',
description: '使用光學字元識別處理圖片或掃描文件',
icon: Cpu,
color: 'text-green-600',
bgColor: 'bg-green-50',
borderColor: 'border-green-200',
recommended: recommendedTrack === 'ocr',
},
]
return (
<Card>
<CardHeader>
<div className="flex items-center gap-3">
<div className="p-2 bg-primary/10 rounded-lg">
<Sparkles className="w-5 h-5 text-primary" />
</div>
<div>
<CardTitle></CardTitle>
<p className="text-sm text-muted-foreground mt-1">
</p>
</div>
</div>
</CardHeader>
<CardContent className="space-y-3">
{/* Info about override */}
{value !== null && recommendedTrack && value !== recommendedTrack && (
<div className="flex items-start gap-2 p-3 bg-amber-50 border border-amber-200 rounded-lg">
<Info className="w-4 h-4 text-amber-600 flex-shrink-0 mt-0.5" />
<p className="text-sm text-amber-800">
使{recommendedTrack === 'direct' ? '直接提取' : 'OCR 識別'}
</p>
</div>
)}
{/* Track options */}
<div className="grid gap-3">
{tracks.map((track) => {
const isSelected = value === track.id
const Icon = track.icon
return (
<button
key={track.id ?? 'auto'}
type="button"
disabled={disabled}
onClick={() => onChange(track.id)}
className={`
w-full p-4 rounded-lg border-2 text-left transition-all
${isSelected
? `${track.borderColor} ${track.bgColor}`
: 'border-border hover:border-primary/30 hover:bg-muted/30'
}
${disabled ? 'opacity-50 cursor-not-allowed' : 'cursor-pointer'}
`}
>
<div className="flex items-start gap-3">
<div className={`p-2 rounded-lg ${isSelected ? track.bgColor : 'bg-muted'}`}>
<Icon className={`w-5 h-5 ${isSelected ? track.color : 'text-muted-foreground'}`} />
</div>
<div className="flex-1 min-w-0">
<div className="flex items-center gap-2">
<span className={`font-medium ${isSelected ? track.color : ''}`}>
{track.name}
</span>
{track.recommended && (
<Badge variant="outline" className="text-xs bg-white">
</Badge>
)}
{isSelected && (
<Badge variant="default" className="text-xs">
</Badge>
)}
</div>
<p className="text-sm text-muted-foreground mt-1">
{track.description}
</p>
</div>
</div>
</button>
)
})}
</div>
{/* Current analysis info */}
{documentAnalysis && (
<div className="pt-3 border-t border-border">
<div className="flex flex-wrap gap-x-4 gap-y-1 text-xs text-muted-foreground">
<span>: {(documentAnalysis.confidence * 100).toFixed(0)}%</span>
{documentAnalysis.page_count && (
<span>: {documentAnalysis.page_count}</span>
)}
{documentAnalysis.text_coverage !== null && (
<span>: {(documentAnalysis.text_coverage * 100).toFixed(1)}%</span>
)}
</div>
</div>
)}
</CardContent>
</Card>
)
}

View File

@@ -8,14 +8,15 @@ import { Button } from '@/components/ui/button'
import { Badge } from '@/components/ui/badge' import { Badge } from '@/components/ui/badge'
import { useToast } from '@/components/ui/toast' import { useToast } from '@/components/ui/toast'
import { apiClientV2 } from '@/services/apiV2' import { apiClientV2 } from '@/services/apiV2'
import { Play, CheckCircle, FileText, AlertCircle, Clock, Activity, Loader2, Info } from 'lucide-react' import { Play, CheckCircle, FileText, AlertCircle, Clock, Activity, Loader2 } from 'lucide-react'
import LayoutModelSelector from '@/components/LayoutModelSelector' import LayoutModelSelector from '@/components/LayoutModelSelector'
import PreprocessingSettings from '@/components/PreprocessingSettings' import PreprocessingSettings from '@/components/PreprocessingSettings'
import PreprocessingPreview from '@/components/PreprocessingPreview' import PreprocessingPreview from '@/components/PreprocessingPreview'
import TableDetectionSelector from '@/components/TableDetectionSelector' import TableDetectionSelector from '@/components/TableDetectionSelector'
import ProcessingTrackSelector from '@/components/ProcessingTrackSelector'
import TaskNotFound from '@/components/TaskNotFound' import TaskNotFound from '@/components/TaskNotFound'
import { useTaskValidation } from '@/hooks/useTaskValidation' import { useTaskValidation } from '@/hooks/useTaskValidation'
import type { LayoutModel, ProcessingOptions, PreprocessingMode, PreprocessingConfig, TableDetectionConfig, DocumentAnalysisResponse } from '@/types/apiV2' import type { LayoutModel, ProcessingOptions, PreprocessingMode, PreprocessingConfig, TableDetectionConfig, ProcessingTrack } from '@/types/apiV2'
export default function ProcessingPage() { export default function ProcessingPage() {
const { t } = useTranslation() const { t } = useTranslation()
@@ -56,6 +57,9 @@ export default function ProcessingPage() {
enable_region_detection: true, enable_region_detection: true,
}) })
// Processing track override state (null = use system recommendation)
const [forceTrack, setForceTrack] = useState<ProcessingTrack | null>(null)
// Analyze document to determine if OCR is needed (only for pending tasks) // Analyze document to determine if OCR is needed (only for pending tasks)
const { data: documentAnalysis, isLoading: isAnalyzing } = useQuery({ const { data: documentAnalysis, isLoading: isAnalyzing } = useQuery({
queryKey: ['documentAnalysis', taskId], queryKey: ['documentAnalysis', taskId],
@@ -65,16 +69,23 @@ export default function ProcessingPage() {
}) })
// Determine if preprocessing options should be shown // Determine if preprocessing options should be shown
// Only show for OCR track files (images and non-editable PDFs) // Show OCR options when:
const needsOcrTrack = documentAnalysis?.recommended_track === 'ocr' || // 1. User explicitly selected OCR track
// 2. OR system recommends OCR/hybrid track (and user hasn't overridden to direct)
// 3. OR still analyzing (show by default)
const needsOcrTrack = forceTrack === 'ocr' ||
(forceTrack === null && (
documentAnalysis?.recommended_track === 'ocr' ||
documentAnalysis?.recommended_track === 'hybrid' || documentAnalysis?.recommended_track === 'hybrid' ||
!documentAnalysis // Show by default while analyzing !documentAnalysis
))
// Start OCR processing // Start OCR processing
const processOCRMutation = useMutation({ const processOCRMutation = useMutation({
mutationFn: () => { mutationFn: () => {
const options: ProcessingOptions = { const options: ProcessingOptions = {
use_dual_track: true, use_dual_track: forceTrack === null, // Only use dual-track auto-detection if not forcing
force_track: forceTrack || undefined, // Pass force_track if user selected one
language: 'ch', language: 'ch',
layout_model: layoutModel, layout_model: layoutModel,
preprocessing_mode: preprocessingMode, preprocessing_mode: preprocessingMode,
@@ -392,53 +403,14 @@ export default function ProcessingPage() {
</div> </div>
)} )}
{/* Document Analysis Info */} {/* Processing Track Selector - Always show after analysis */}
{documentAnalysis && ( {!isAnalyzing && (
<Card className={documentAnalysis.recommended_track === 'direct' ? 'border-blue-200 bg-blue-50' : 'border-green-200 bg-green-50'}> <ProcessingTrackSelector
<CardContent className="pt-4"> value={forceTrack}
<div className="flex items-start gap-3"> onChange={setForceTrack}
<Info className={`w-5 h-5 flex-shrink-0 mt-0.5 ${documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}`} /> documentAnalysis={documentAnalysis}
<div className="flex-1"> disabled={processOCRMutation.isPending}
{documentAnalysis.recommended_track === 'direct' ? ( />
<>
<p className="text-sm font-medium text-blue-800"> PDF</p>
<p className="text-sm text-blue-700 mt-1">
PDF 使
</p>
</>
) : (
<>
<p className="text-sm font-medium text-green-800">
{documentAnalysis.is_editable ? '混合文件' : '掃描文件 / 影像'}
</p>
<p className="text-sm text-green-700 mt-1">
{documentAnalysis.reason}
</p>
</>
)}
<div className="flex flex-wrap gap-4 mt-2 text-xs">
<span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
: {documentAnalysis.recommended_track === 'direct' ? '直接提取' : documentAnalysis.recommended_track === 'ocr' ? 'OCR 識別' : '混合處理'}
</span>
{documentAnalysis.page_count && (
<span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
: {documentAnalysis.page_count}
</span>
)}
{documentAnalysis.text_coverage !== null && (
<span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
: {(documentAnalysis.text_coverage * 100).toFixed(1)}%
</span>
)}
<span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
: {(documentAnalysis.confidence * 100).toFixed(0)}%
</span>
</div>
</div>
</div>
</CardContent>
</Card>
)} )}
{/* OCR Track Options - Only show when document needs OCR */} {/* OCR Track Options - Only show when document needs OCR */}

View File

@@ -0,0 +1,240 @@
# Design: Refactor Dual-Track Architecture
## Context
Tool_OCR 是一個雙軌制文件處理系統,支援:
- **Direct Track**: 從可編輯 PDF 直接提取結構化內容
- **OCR Track**: 使用 PaddleOCR + PP-StructureV3 進行光學字符識別
目前系統存在以下技術債務:
- OCRService (2,326 行) 承擔過多職責
- PDFGeneratorService (4,644 行) 是單體服務
- 記憶體管理分散在多個組件中
- 已知 bug 影響輸出品質
## Goals / Non-Goals
### Goals
- 修復 PLAN.md 中列出的所有已知 bug
- 將 OCRService 拆分為 < 800 行的可維護單元
- PDFGeneratorService 拆分為 < 2,000
- 簡化記憶體管理配置
- 提升前端狀態管理一致性
### Non-Goals
- 不改變現有 API 契約
- 不引入新的外部依賴
- 不改變資料庫 schema
- 不改變使用者介面
## Decisions
### Decision 1: 使用 PyMuPDF find_tables() 取代自定義表格檢測
**選擇**: 使用 PyMuPDF 內建的 `page.find_tables()` API
**理由**:
- PyMuPDF 的表格檢測能正確識別合併單元格
- 返回的 `table.cells` 結構包含 span 資訊
- 減少自定義代碼維護負擔
**替代方案**:
- 改進 `_detect_tables_by_position()` 算法
- 優點不依賴外部 API 變更
- 缺點複雜度高難以處理所有邊界情況
- 使用 Camelot Tabula
- 優點成熟的表格提取庫
- 缺點引入新依賴增加系統複雜度
### Decision 2: 使用 Strategy Pattern 重構服務層
**選擇**: 引入 ProcessingOrchestrator 使用策略模式
```python
class ProcessingPipeline(Protocol):
def process(self, file_path: str, options: ProcessingOptions) -> UnifiedDocument:
...
class DirectPipeline(ProcessingPipeline):
def __init__(self, extraction_engine: DirectExtractionEngine):
self.engine = extraction_engine
def process(self, file_path, options):
return self.engine.extract(file_path)
class OCRPipeline(ProcessingPipeline):
def __init__(self, ocr_service: OCRService, preprocessor: LayoutPreprocessingService):
self.ocr = ocr_service
self.preprocessor = preprocessor
def process(self, file_path, options):
# Preprocessing + OCR + Conversion
...
class ProcessingOrchestrator:
def __init__(self, detector: DocumentTypeDetector, pipelines: dict[str, ProcessingPipeline]):
self.detector = detector
self.pipelines = pipelines
def process(self, file_path, options):
track = options.force_track or self.detector.detect(file_path).track
return self.pipelines[track].process(file_path, options)
```
**理由**:
- 職責分離檢測處理轉換各自獨立
- 易於測試可以單獨測試每個 Pipeline
- 易於擴展新增處理方式只需添加新 Pipeline
**替代方案**:
- 使用 Chain of Responsibility
- 優點更靈活的處理鏈
- 缺點對於二選一的場景過於複雜
- 保持現狀只做代碼整理
- 優點風險最低
- 缺點無法解決根本問題
### Decision 3: 分層提取 PDF 生成邏輯
**選擇**: PDFGeneratorService 拆分為三個模組
```
PDFGeneratorService (主要編排)
├── PDFTableRenderer (表格渲染)
│ ├── HTMLTableParser (HTML 表格解析)
│ └── CellRenderer (單元格渲染)
├── PDFFontManager (字體管理)
│ ├── FontLoader (字體載入)
│ └── FontFallback (字體 fallback)
└── PDFLayoutEngine (版面配置)
```
**理由**:
- 單一職責每個模組專注一件事
- 可重用FontManager 可被其他服務使用
- 易於測試表格渲染可獨立測試
### Decision 4: 統一記憶體策略引擎
**選擇**: 合併記憶體管理組件為單一 MemoryPolicyEngine
```python
class MemoryPolicyEngine:
"""統一的記憶體策略引擎"""
def __init__(self, config: MemoryConfig):
self.config = config
self._semaphore = asyncio.Semaphore(config.max_concurrent_predictions)
@property
def gpu_usage_percent(self) -> float:
# 統一的 GPU 使用率查詢
...
def check_availability(self) -> MemoryStatus:
# 返回 AVAILABLE, WARNING, CRITICAL, EMERGENCY
...
async def acquire_prediction_slot(self):
# 統一的並發控制
...
def cleanup_if_needed(self):
# 根據狀態自動清理
...
@dataclass
class MemoryConfig:
warning_threshold: float = 0.80 # 80%
critical_threshold: float = 0.95 # 95%
max_concurrent_predictions: int = 2
model_idle_timeout: int = 300 # 5 minutes
```
**理由**:
- 減少配置項 8+ 降到 4 個核心配置
- 簡化依賴服務只需依賴一個記憶體引擎
- 統一行為所有記憶體決策在同一處做出
### Decision 5: 使用 Zustand 管理任務狀態
**選擇**: 新增 TaskStore 統一管理任務狀態
```typescript
interface TaskState {
currentTaskId: string | null;
tasks: Record<string, TaskDetail>;
processingStatus: Record<string, ProcessingStatus>;
}
interface TaskActions {
setCurrentTask: (taskId: string) => void;
updateTask: (taskId: string, updates: Partial<TaskDetail>) => void;
updateProcessingStatus: (taskId: string, status: ProcessingStatus) => void;
clearTasks: () => void;
}
const useTaskStore = create<TaskState & TaskActions>()(
persist(
(set) => ({
currentTaskId: null,
tasks: {},
processingStatus: {},
// ... actions
}),
{ name: 'task-storage' }
)
);
```
**理由**:
- 一致性與現有 uploadStoreauthStore 模式一致
- 可追蹤任務狀態變更集中管理
- 持久化刷新頁面後狀態保留
## Risks / Trade-offs
| 風險 | 影響 | 緩解措施 |
|------|------|----------|
| PyMuPDF find_tables() API 變更 | | 封裝為獨立函數易於替換 |
| 服務重構導致處理邏輯錯誤 | | 保留原有測試逐步重構 |
| 記憶體引擎改變導致 OOM | | 使用相同閾值僅改變代碼結構 |
| 前端狀態遷移導致 bug | | 逐頁遷移完整測試每個頁面 |
## Migration Plan
### Step 1: Bug Fixes (可獨立部署)
1. 實現 PyMuPDF find_tables() 整合
2. 修復 OCR Track 圖片路徑
3. 添加 cell_boxes 座標驗證
4. 測試並部署
### Step 2: Service Refactoring (可獨立部署)
1. 提取 ProcessingOrchestrator
2. 提取 TableRenderer FontManager
3. 更新 OCRService 使用新組件
4. 測試並部署
### Step 3: Memory Management (可獨立部署)
1. 實現 MemoryPolicyEngine
2. 逐步遷移服務使用新引擎
3. 移除舊組件
4. 測試並部署
### Step 4: Frontend Improvements (可獨立部署)
1. 新增 TaskStore
2. 遷移 ProcessingPage
3. 遷移 TaskDetailPage
4. 合併類型定義
5. 測試並部署
### Rollback Plan
- 每個 Step 獨立部署問題時可回滾到上一個穩定版本
- Bug fixes 優先確保基本功能正確
- 重構不改變外部行為回滾影響最小
## Open Questions
1. **PyMuPDF find_tables() 的版本相容性**: 需確認目前使用的 PyMuPDF 版本是否支援此 API
2. **前端狀態持久化範圍**: 是否所有任務都需要持久化還是只保留當前會話
3. **記憶體閾值調整**: 現有閾值是否經過生產驗證可以直接沿用

View File

@@ -0,0 +1,68 @@
# Change: Refactor Dual-Track Architecture
## Why
目前雙軌制 OCR 系統存在多個已知問題和架構債務:
1. **Direct Track 表格問題**: `_detect_tables_by_position()` 無法識別合併單元格,導致 edit3.pdf 產生 204 個錯誤拆分的 cells應為 83 個)
2. **OCR Track 圖片路徑丟失**: CHART/DIAGRAM 等視覺元素的 `saved_path` 在轉換時丟失,導致圖片未放回 PDF
3. **OCR Track cell_boxes 座標錯亂**: PP-StructureV3 返回的 cell_boxes 超出頁面邊界
4. **服務層過度複雜**: OCRService (2,326 行) 承擔過多職責,難以維護和測試
5. **PDF 生成器過於龐大**: PDFGeneratorService (4,644 行) 是單體服務,難以擴展
## What Changes
### Phase 1: 修復已知 Bug優先級最高
- **Direct Track 表格修復**: 改用 PyMuPDF `find_tables()` API 取代 `_detect_tables_by_position()`
- **OCR Track 圖片路徑修復**: 擴展 `_convert_pp3_element` 處理所有視覺元素類型 (IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP)
- **Cell boxes 座標驗證**: 添加邊界檢查,超出範圍時使用 CV 線檢測 fallback
- **過濾極小裝飾圖片**: 過濾 < 200 px² 的圖片
- **移除覆蓋圖像**: 在渲染階段過濾與 covering_images 重疊的圖片
### Phase 2: 服務層重構(優先級:高)
- **拆分 OCRService**: 提取獨立的 `ProcessingOrchestrator` 負責流程編排
- **建立 Pipeline 模式**: 使用組合模式取代目前的聚合模式
- **提取 TableRenderer**: PDFGeneratorService 提取表格渲染邏輯
- **提取 FontManager**: PDFGeneratorService 提取字體管理邏輯
### Phase 3: 記憶體管理簡化(優先級:中)
- **統一記憶體策略**: 合併 MemoryManagerMemoryGuard各類 Semaphore 為單一策略引擎
- **簡化配置**: 減少 8+ 個記憶體相關配置項到核心 3-4
### Phase 4: 前端狀態管理改進(優先級:中)
- **新增 TaskStore**: 使用 Zustand 管理任務狀態取代分散的 useState
- **合併類型定義**: 統一 api.ts apiV2.ts 為單一類型定義檔案
## Impact
- Affected specs: `document-processing`
- Affected code:
- `backend/app/services/direct_extraction_engine.py` (表格檢測)
- `backend/app/services/ocr_to_unified_converter.py` (元素轉換)
- `backend/app/services/ocr_service.py` (服務編排)
- `backend/app/services/pdf_generator_service.py` (PDF 生成)
- `backend/app/services/memory_manager.py` (記憶體管理)
- `frontend/src/store/` (狀態管理)
- `frontend/src/types/` (類型定義)
## Risk Assessment
| 風險 | 嚴重性 | 緩解措施 |
|------|--------|----------|
| 表格渲染回歸 | | 使用 edit.pdf edit3.pdf 作為回歸測試 |
| 記憶體管理變更導致 OOM | | 保留現有閾值僅重構代碼結構 |
| 服務重構導致處理失敗 | | 逐步重構每階段完整測試 |
## Success Metrics
| 指標 | 目前 | 目標 |
|------|------|------|
| edit3.pdf Direct Track cells | 204 (錯誤) | 83 (正確) |
| OCR Track 圖片放回率 | 0% | 100% |
| cell_boxes 座標正確率 | ~40% | 100% |
| OCRService 行數 | 2,326 | < 800 |
| PDFGeneratorService 行數 | 4,644 | < 2,000 |

View File

@@ -0,0 +1,151 @@
# document-processing Specification Delta
## ADDED Requirements
### Requirement: Table Cell Merging Detection
The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents.
#### Scenario: Detect merged cells in Direct Track
- **WHEN** extracting tables from an editable PDF using Direct Track
- **THEN** the system SHALL use PyMuPDF find_tables() API
- **AND** correctly identify cells with rowspan > 1 or colspan > 1
- **AND** preserve merge information in UnifiedDocument table structure
- **AND** skip placeholder cells that are covered by merged cells
#### Scenario: Handle complex table structures
- **WHEN** processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges)
- **THEN** the system SHALL NOT split merged cells into individual cells
- **AND** the output cell count SHALL match the actual visual cell count
- **AND** the rendered PDF SHALL display correct merged cell boundaries
### Requirement: Visual Element Path Preservation
The system SHALL preserve image paths for all visual element types during OCR conversion.
#### Scenario: Preserve CHART element paths
- **WHEN** converting PP-StructureV3 output containing CHART elements
- **THEN** the system SHALL treat CHART as a visual element type
- **AND** extract saved_path from the element data
- **AND** include saved_path in the UnifiedDocument content field
#### Scenario: Support all visual element types
- **WHEN** processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP
- **THEN** the system SHALL extract saved_path or img_path for each element
- **AND** preserve path, width, height, and format in content dictionary
- **AND** enable downstream PDF generation to embed these images
#### Scenario: Fallback path resolution
- **WHEN** a visual element has multiple path fields (saved_path, img_path)
- **THEN** the system SHALL prefer saved_path over img_path
- **AND** fallback to img_path if saved_path is missing
- **AND** log warning if both paths are missing
### Requirement: Cell Box Coordinate Validation
The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases.
#### Scenario: Detect out-of-bounds coordinates
- **WHEN** processing cell_boxes from PP-StructureV3
- **THEN** the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height)
- **AND** log tables with coordinates exceeding page bounds
- **AND** mark affected cells for fallback processing
#### Scenario: Apply CV line detection fallback
- **WHEN** cell_boxes coordinates are invalid (out of bounds)
- **THEN** the system SHALL apply OpenCV line detection as fallback
- **AND** reconstruct table structure from detected lines
- **AND** include fallback_used flag in table metadata
#### Scenario: Coordinate normalization
- **WHEN** coordinates are within page bounds but slightly outside table bbox
- **THEN** the system SHALL clamp coordinates to table boundaries
- **AND** preserve relative cell positions
- **AND** ensure no cells overlap after normalization
### Requirement: Decoration Image Filtering
The system SHALL filter out minimal decoration images that do not contribute meaningful content.
#### Scenario: Filter tiny images by area
- **WHEN** extracting images from a document
- **THEN** the system SHALL calculate image area (width x height)
- **AND** filter out images with area < 200 square pixels
- **AND** log filtered image count for debugging
#### Scenario: Configurable filtering threshold
- **WHEN** processing documents with intentionally small images
- **THEN** the system SHALL support configuration of minimum image area threshold
- **AND** default to 200 square pixels if not specified
- **AND** allow threshold = 0 to disable filtering
### Requirement: Covering Image Removal
The system SHALL remove covering/redaction images from the final output.
#### Scenario: Detect covering rectangles
- **WHEN** preprocessing a PDF page
- **THEN** the system SHALL detect black/white rectangles covering text regions
- **AND** identify covering images by high IoU (> 0.8) with underlying content
- **AND** mark covering images for exclusion
#### Scenario: Exclude covering images from rendering
- **WHEN** generating output PDF
- **THEN** the system SHALL exclude images marked as covering
- **AND** preserve the text content that was covered
- **AND** include covering_images_removed count in metadata
#### Scenario: Handle both black and white covering
- **WHEN** detecting covering rectangles
- **THEN** the system SHALL detect both black fill (redaction style)
- **AND** white fill (whiteout style)
- **AND** low-contrast rectangles intended to hide content
## MODIFIED Requirements
### Requirement: Enhanced OCR with Full PP-StructureV3
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
#### Scenario: Extract comprehensive document structure
- **WHEN** processing through OCR track
- **THEN** the system SHALL use page_result.json['parsing_res_list']
- **AND** extract all element types including headers, lists, tables, figures
- **AND** preserve layout_bbox coordinates for each element
#### Scenario: Maintain reading order
- **WHEN** extracting elements from PP-StructureV3
- **THEN** the system SHALL preserve the reading order from parsing_res_list
- **AND** assign sequential indices to elements
- **AND** support reordering for complex layouts
#### Scenario: Extract table structure
- **WHEN** PP-StructureV3 identifies a table
- **THEN** the system SHALL extract cell content and boundaries
- **AND** validate cell_boxes coordinates against page boundaries
- **AND** apply fallback detection for invalid coordinates
- **AND** preserve table HTML for structure
- **AND** extract plain text for translation
#### Scenario: Extract visual elements with paths
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
- **THEN** the system SHALL preserve saved_path for each element
- **AND** include image dimensions and format
- **AND** enable image embedding in output PDF
### Requirement: Generate UnifiedDocument from direct extraction
The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
#### Scenario: Extract tables with cell merging
- **WHEN** direct extraction encounters a table
- **THEN** the system SHALL use PyMuPDF find_tables() API
- **AND** extract cell content with correct rowspan/colspan
- **AND** preserve merged cell boundaries
- **AND** skip placeholder cells covered by merges
#### Scenario: Filter decoration images
- **WHEN** extracting images from PDF
- **THEN** the system SHALL filter images smaller than minimum area threshold
- **AND** exclude covering/redaction images
- **AND** preserve meaningful content images
#### Scenario: Preserve text styling with image handling
- **WHEN** direct extraction completes
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
- **AND** preserve text styling, fonts, and exact positioning
- **AND** extract tables with cell boundaries, content, and merge info
- **AND** include only meaningful images in output

View File

@@ -0,0 +1,108 @@
# Tasks: Refactor Dual-Track Architecture
## Phase 1: 修復已知 Bug (已完成)
### 1.1 Direct Track 表格修復 (已完成 ✓)
- [x] 1.1.1 修改 `_process_native_table()` 方法使用 `table.cells` 處理合併單元格
- [x] 1.1.2 使用 PyMuPDF `page.find_tables()` API (已在使用中)
- [x] 1.1.3 解析 `table.cells` 並正確計算 `row_span`/`col_span`
- [x] 1.1.4 處理被合併的單元格(跳過 `None` 值,建立 covered grid
- [x] 1.1.5 驗證 edit3.pdf 返回 83 個正確的 cells ✓
### 1.2 OCR Track 圖片路徑修復 (已完成 ✓)
- [x] 1.2.1 修改 `ocr_to_unified_converter.py` 第 604-613 行
- [x] 1.2.2 擴展視覺元素類型判斷:`IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP`
- [x] 1.2.3 優先使用 `saved_path`fallback 到 `img_path`
- [x] 1.2.4 確保 content dict 包含 `saved_path`, `path`, `width`, `height`, `format`
- [x] 1.2.5 程式碼已修正 (需 OCR Track 完整測試驗證)
- [x] 1.2.6 程式碼已修正 (需 OCR Track 完整測試驗證)
### 1.3 Cell boxes 座標驗證 (已完成 ✓)
- [x] 1.3.1 在 `ocr_to_unified_converter.py` 添加 `validate_cell_boxes()` 函數
- [x] 1.3.2 檢查 cell_boxes 是否超出頁面邊界 (0, 0, page_width, page_height)
- [x] 1.3.3 超出範圍時使用 clamped coordinates標記 needs_fallback
- [x] 1.3.4 添加日誌記錄異常座標
- [x] 1.3.5 單元測試驗證座標驗證邏輯正確 ✓
### 1.4 過濾極小裝飾圖片 (已完成 ✓)
- [x] 1.4.1 在 `direct_extraction_engine.py` 圖片提取邏輯添加面積檢查
- [x] 1.4.2 過濾 `image_area < min_image_area` (默認 200 px²) 的圖片
- [x] 1.4.3 添加 `min_image_area` 配置項允許調整閾值
- [x] 1.4.4 驗證 edit3.pdf 偵測到 3 個極小裝飾圖片 ✓
### 1.5 移除覆蓋圖像 (已完成 ✓)
- [x] 1.5.1 傳遞 `covering_images``_extract_images()` 方法
- [x] 1.5.2 使用 IoU 閾值 (0.8) 和 xref 比對判斷覆蓋圖像
- [x] 1.5.3 從最終輸出中排除覆蓋圖像
- [x] 1.5.4 添加 `_calculate_iou()` 輔助方法
- [x] 1.5.5 驗證 edit3.pdf 偵測到 6 個黑框覆蓋圖像 ✓
## Phase 2: 服務層重構
### 2.1 提取 ProcessingOrchestrator
- [ ] 2.1.1 建立 `backend/app/services/processing_orchestrator.py`
- [ ] 2.1.2 從 OCRService 提取流程編排邏輯
- [ ] 2.1.3 定義 `ProcessingPipeline` 介面
- [ ] 2.1.4 實現 DirectPipeline 和 OCRPipeline
- [ ] 2.1.5 更新 OCRService 使用 ProcessingOrchestrator
- [ ] 2.1.6 確保現有功能不受影響
### 2.2 提取 TableRenderer
- [ ] 2.2.1 建立 `backend/app/services/pdf_table_renderer.py`
- [ ] 2.2.2 從 PDFGeneratorService 提取 HTMLTableParser
- [ ] 2.2.3 提取表格渲染邏輯到獨立類
- [ ] 2.2.4 支援合併單元格渲染
- [ ] 2.2.5 更新 PDFGeneratorService 使用 TableRenderer
### 2.3 提取 FontManager
- [ ] 2.3.1 建立 `backend/app/services/pdf_font_manager.py`
- [ ] 2.3.2 提取字體載入和快取邏輯
- [ ] 2.3.3 提取 CJK 字體支援邏輯
- [ ] 2.3.4 實現字體 fallback 機制
- [ ] 2.3.5 更新 PDFGeneratorService 使用 FontManager
## Phase 3: 記憶體管理簡化
### 3.1 統一記憶體策略引擎
- [ ] 3.1.1 建立 `backend/app/services/memory_policy_engine.py`
- [ ] 3.1.2 定義統一的記憶體策略介面
- [ ] 3.1.3 合併 MemoryManager 和 MemoryGuard 邏輯
- [ ] 3.1.4 整合 Semaphore 管理
- [ ] 3.1.5 簡化配置到 3-4 個核心項目
### 3.2 更新服務使用新記憶體引擎
- [ ] 3.2.1 更新 OCRService 使用 MemoryPolicyEngine
- [ ] 3.2.2 更新 ServicePool 使用 MemoryPolicyEngine
- [ ] 3.2.3 移除舊的 MemoryGuard 引用
- [ ] 3.2.4 驗證 GPU 記憶體監控正常運作
## Phase 4: 前端狀態管理改進
### 4.1 新增 TaskStore
- [ ] 4.1.1 建立 `frontend/src/store/taskStore.ts`
- [ ] 4.1.2 定義任務狀態結構currentTask, tasks, processingStatus
- [ ] 4.1.3 實現 CRUD 操作和狀態轉換
- [ ] 4.1.4 添加 localStorage 持久化
- [ ] 4.1.5 更新 ProcessingPage 使用 TaskStore
- [ ] 4.1.6 更新 TaskDetailPage 使用 TaskStore
### 4.2 合併類型定義
- [ ] 4.2.1 審查 `api.ts``apiV2.ts` 的差異
- [ ] 4.2.2 合併類型定義到 `apiV2.ts`
- [ ] 4.2.3 移除 `api.ts` 中的重複定義
- [ ] 4.2.4 更新所有 import 路徑
- [ ] 4.2.5 驗證 TypeScript 編譯無錯誤
## Phase 5: 測試與驗證
### 5.1 回歸測試
- [ ] 5.1.1 使用 edit.pdf 測試 Direct Track確保無回歸
- [ ] 5.1.2 使用 edit3.pdf 測試 Direct Track 表格合併
- [ ] 5.1.3 使用 edit.pdf 測試 OCR Track 圖片放回
- [ ] 5.1.4 使用 edit3.pdf 測試 OCR Track 圖片放回
- [ ] 5.1.5 驗證所有 cell_boxes 座標正確
### 5.2 效能測試
- [ ] 5.2.1 測量重構後的處理時間
- [ ] 5.2.2 驗證記憶體使用無明顯增加
- [ ] 5.2.3 驗證 GPU 使用率正常