test
This commit is contained in:
186
PLAN.md
Normal file
186
PLAN.md
Normal file
@@ -0,0 +1,186 @@
|
||||
# PDF 處理雙軌制改善計劃 (修訂版 v5)
|
||||
|
||||
## 問題分析
|
||||
|
||||
### 一、Direct Track 表格問題
|
||||
|
||||
| 指標 | edit.pdf | edit3.pdf |
|
||||
|------|----------|-----------|
|
||||
| 原始表格結構 | 6 rows x 2 cols | 12 rows x 17 cols |
|
||||
| PyMuPDF 識別的 cells | 12 (無合併) | **83** (有121個合併) |
|
||||
| Direct Track 提取的 cells | 12 | **204** (全部視為1x1) |
|
||||
| 跨欄/跨行識別 | 不需要 | **❌ 完全未識別** |
|
||||
| 渲染結果 | ✓ 完美 | ❌ 欄位切分錯誤、文字超出 |
|
||||
|
||||
**根因**: `_detect_tables_by_position()` 無法識別合併單元格
|
||||
|
||||
### 二、Direct Track 圖片問題 (edit3.pdf)
|
||||
|
||||
| 問題 | 數量 | 說明 |
|
||||
|------|------|------|
|
||||
| 極小裝飾圖片 | 3 | < 200 px²,應過濾 |
|
||||
| 覆蓋圖像 (黑框) | 6 | 已檢測但未從渲染中移除 |
|
||||
| 大型 vector_graphics | 3 | ✓ 已正確過濾 |
|
||||
|
||||
### 三、OCR Track 表格問題
|
||||
|
||||
| 表格 | cells | cell_boxes | cell_boxes 坐標檢查 |
|
||||
|------|-------|------------|-------------------|
|
||||
| pp3_0_3 | 13 | 13 | ⚠️ 1/5 超出範圍 |
|
||||
| pp3_0_6 | 29 | 12 | ❌ 全部超出範圍 |
|
||||
| pp3_0_7 | 12 | 51 | ❌ 全部超出範圍 |
|
||||
| pp3_0_16 | 51 | 29 | ❌ 全部超出範圍 |
|
||||
|
||||
**根因**: PP-StructureV3 的 cell_boxes 座標系統錯亂
|
||||
|
||||
### 四、OCR Track 圖片問題 ❌ 嚴重
|
||||
|
||||
| 文件 | 圖片元素 | PP-Structure 原始數據 | 轉換後 UnifiedDocument | 結果 |
|
||||
|------|---------|---------------------|----------------------|------|
|
||||
| edit.pdf | pp3_1_8 | saved_path="pp3_1_8.png" ✓ | content=字符串 ❌ | 圖片未放回 |
|
||||
| edit3.pdf | pp3_1_2 | saved_path="pp3_1_2.png" ✓ | content=字符串 ❌ | 圖片未放回 |
|
||||
|
||||
**根因**: `ocr_to_unified_converter.py` 的 `_convert_pp3_element` 方法中:
|
||||
|
||||
```python
|
||||
# 當前代碼 (第604-613行)
|
||||
elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
|
||||
content = {'path': elem_data.get('img_path', ''), ...}
|
||||
else:
|
||||
content = elem_data.get('content', '') # ← CHART 類型走這裡!
|
||||
```
|
||||
|
||||
**問題**:
|
||||
1. `CHART` 類型未被視為視覺元素
|
||||
2. `saved_path` 完全丟失
|
||||
3. `content` 變成文字而非圖片路徑
|
||||
|
||||
---
|
||||
|
||||
## 改善計劃
|
||||
|
||||
### 階段 1: Direct Track 使用 PyMuPDF find_tables (優先級:最高)
|
||||
|
||||
**問題**: `_detect_tables_by_position` 無法識別合併單元格
|
||||
|
||||
**方案**: 改用 PyMuPDF 的 `find_tables()` API
|
||||
|
||||
**檔案**: `backend/app/services/direct_extraction_engine.py`
|
||||
|
||||
```python
|
||||
def _extract_tables_with_pymupdf(self, page, page_num, counter):
|
||||
tables = page.find_tables()
|
||||
for table in tables.tables:
|
||||
# 獲取 cells,保留合併信息
|
||||
cells = []
|
||||
for row_idx in range(table.row_count):
|
||||
for col_idx in range(table.col_count):
|
||||
cell_data = table.cells[row_idx * table.col_count + col_idx]
|
||||
if cell_data is None:
|
||||
continue # 跳過被合併的單元格
|
||||
# 計算 row_span/col_span...
|
||||
```
|
||||
|
||||
### 階段 2: 修復 OCR Track 圖片路徑丟失 (優先級:最高)
|
||||
|
||||
**問題**: CHART 類型的 saved_path 在轉換時丟失
|
||||
|
||||
**檔案**: `backend/app/services/ocr_to_unified_converter.py`
|
||||
**位置**: `_convert_pp3_element` 方法,約第604行
|
||||
|
||||
**修改**:
|
||||
|
||||
```python
|
||||
# 修改前
|
||||
elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
|
||||
|
||||
# 修改後:包含所有視覺元素類型
|
||||
elif element_type in [
|
||||
ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART,
|
||||
ElementType.DIAGRAM, ElementType.LOGO, ElementType.STAMP
|
||||
]:
|
||||
# 優先使用 saved_path
|
||||
image_path = (
|
||||
elem_data.get('saved_path') or
|
||||
elem_data.get('img_path') or
|
||||
''
|
||||
)
|
||||
content = {
|
||||
'saved_path': image_path, # 關鍵:保留 saved_path
|
||||
'path': image_path,
|
||||
'width': elem_data.get('width', 0),
|
||||
'height': elem_data.get('height', 0),
|
||||
'format': elem_data.get('format', 'unknown')
|
||||
}
|
||||
```
|
||||
|
||||
### 階段 3: 修復 OCR Track cell_boxes 座標 (優先級:高)
|
||||
|
||||
**方案**: 驗證座標,超出範圍時使用 CV 線檢測 fallback
|
||||
|
||||
### 階段 4: 過濾極小裝飾圖片 (優先級:高)
|
||||
|
||||
```python
|
||||
if elem_area < 200:
|
||||
continue # 跳過 < 200 px² 的圖片
|
||||
```
|
||||
|
||||
### 階段 5: 過濾覆蓋圖像 (優先級:高)
|
||||
|
||||
在提取階段過濾與 covering_images 重疊的圖片。
|
||||
|
||||
---
|
||||
|
||||
## 實施優先級
|
||||
|
||||
| 階段 | 描述 | 優先級 | 影響 |
|
||||
|------|------|--------|------|
|
||||
| 1 | Direct Track 使用 PyMuPDF find_tables | **最高** | 修復合併單元格 |
|
||||
| 2 | **OCR Track 圖片路徑修復** | **最高** | 修復圖片未放回 |
|
||||
| 3 | OCR Track cell_boxes 座標修復 | 高 | 修復表格渲染錯亂 |
|
||||
| 4 | 過濾極小裝飾圖片 | 高 | 減少無意義圖片 |
|
||||
| 5 | 過濾覆蓋圖像 | 高 | 減少黑框 |
|
||||
|
||||
---
|
||||
|
||||
## 預期效果
|
||||
|
||||
### Direct Track
|
||||
|
||||
| 指標 | 修改前 | 修改後 |
|
||||
|------|--------|--------|
|
||||
| edit3.pdf cells | 204 (錯誤拆分) | 83 (正確識別合併) |
|
||||
| 跨欄/跨行識別 | ❌ | ✓ |
|
||||
|
||||
### OCR Track 圖片
|
||||
|
||||
| 指標 | 修改前 | 修改後 |
|
||||
|------|--------|--------|
|
||||
| pp3_1_8 (edit.pdf) | 圖片未放回 | ✓ 正確放回 |
|
||||
| pp3_1_2 (edit3.pdf) | 圖片未放回 | ✓ 正確放回 |
|
||||
|
||||
### OCR Track 表格
|
||||
|
||||
| 指標 | 修改前 | 修改後 |
|
||||
|------|--------|--------|
|
||||
| cell_boxes 座標 | 3/5 表格錯誤 | 全部正確或 CV fallback |
|
||||
|
||||
---
|
||||
|
||||
## 測試計劃
|
||||
|
||||
1. **edit.pdf Direct Track**: 確保無回歸
|
||||
|
||||
2. **edit3.pdf Direct Track**:
|
||||
- 驗證表格識別到 83 cells(非 204)
|
||||
- 驗證跨欄/跨行正確
|
||||
- 驗證極小圖片被過濾
|
||||
- 驗證黑框被過濾
|
||||
|
||||
3. **edit.pdf OCR Track**:
|
||||
- **驗證 pp3_1_8.png 正確放回**
|
||||
- 驗證 cell_boxes 座標修復
|
||||
|
||||
4. **edit3.pdf OCR Track**:
|
||||
- **驗證 pp3_1_2.png 正確放回**
|
||||
- 驗證 cell_boxes 座標修復
|
||||
File diff suppressed because it is too large
Load Diff
@@ -178,6 +178,114 @@ def trim_empty_columns(table_dict: Dict[str, Any]) -> Dict[str, Any]:
|
||||
return result
|
||||
|
||||
|
||||
def validate_cell_boxes(
|
||||
cell_boxes: List[List[float]],
|
||||
table_bbox: List[float],
|
||||
page_width: float,
|
||||
page_height: float,
|
||||
tolerance: float = 5.0
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Validate cell_boxes coordinates against page boundaries and table bbox.
|
||||
|
||||
PP-StructureV3 sometimes returns cell_boxes with coordinates that exceed
|
||||
page boundaries. This function validates and reports issues.
|
||||
|
||||
Args:
|
||||
cell_boxes: List of cell bounding boxes [[x0, y0, x1, y1], ...]
|
||||
table_bbox: Table bounding box [x0, y0, x1, y1]
|
||||
page_width: Page width in pixels
|
||||
page_height: Page height in pixels
|
||||
tolerance: Allowed tolerance for boundary checks (pixels)
|
||||
|
||||
Returns:
|
||||
Dict with:
|
||||
- valid: bool - whether all cell_boxes are valid
|
||||
- invalid_count: int - number of invalid cell_boxes
|
||||
- clamped_boxes: List - cell_boxes clamped to valid boundaries
|
||||
- issues: List[str] - description of issues found
|
||||
"""
|
||||
if not cell_boxes:
|
||||
return {'valid': True, 'invalid_count': 0, 'clamped_boxes': [], 'issues': []}
|
||||
|
||||
issues = []
|
||||
invalid_count = 0
|
||||
clamped_boxes = []
|
||||
|
||||
# Page boundaries with tolerance
|
||||
min_x = -tolerance
|
||||
min_y = -tolerance
|
||||
max_x = page_width + tolerance
|
||||
max_y = page_height + tolerance
|
||||
|
||||
for idx, box in enumerate(cell_boxes):
|
||||
if not box or len(box) < 4:
|
||||
issues.append(f"Cell {idx}: Invalid box format")
|
||||
invalid_count += 1
|
||||
clamped_boxes.append([0, 0, 0, 0])
|
||||
continue
|
||||
|
||||
x0, y0, x1, y1 = box[:4]
|
||||
is_valid = True
|
||||
cell_issues = []
|
||||
|
||||
# Check if coordinates exceed page boundaries
|
||||
if x0 < min_x:
|
||||
cell_issues.append(f"x0={x0:.1f} < 0")
|
||||
is_valid = False
|
||||
if y0 < min_y:
|
||||
cell_issues.append(f"y0={y0:.1f} < 0")
|
||||
is_valid = False
|
||||
if x1 > max_x:
|
||||
cell_issues.append(f"x1={x1:.1f} > page_width={page_width:.1f}")
|
||||
is_valid = False
|
||||
if y1 > max_y:
|
||||
cell_issues.append(f"y1={y1:.1f} > page_height={page_height:.1f}")
|
||||
is_valid = False
|
||||
|
||||
# Check for inverted coordinates
|
||||
if x0 > x1:
|
||||
cell_issues.append(f"x0={x0:.1f} > x1={x1:.1f}")
|
||||
is_valid = False
|
||||
if y0 > y1:
|
||||
cell_issues.append(f"y0={y0:.1f} > y1={y1:.1f}")
|
||||
is_valid = False
|
||||
|
||||
if not is_valid:
|
||||
invalid_count += 1
|
||||
issues.append(f"Cell {idx}: {', '.join(cell_issues)}")
|
||||
|
||||
# Clamp to valid boundaries
|
||||
clamped_box = [
|
||||
max(0, min(x0, page_width)),
|
||||
max(0, min(y0, page_height)),
|
||||
max(0, min(x1, page_width)),
|
||||
max(0, min(y1, page_height))
|
||||
]
|
||||
|
||||
# Ensure proper ordering after clamping
|
||||
if clamped_box[0] > clamped_box[2]:
|
||||
clamped_box[0], clamped_box[2] = clamped_box[2], clamped_box[0]
|
||||
if clamped_box[1] > clamped_box[3]:
|
||||
clamped_box[1], clamped_box[3] = clamped_box[3], clamped_box[1]
|
||||
|
||||
clamped_boxes.append(clamped_box)
|
||||
|
||||
if invalid_count > 0:
|
||||
logger.warning(
|
||||
f"Cell boxes validation: {invalid_count}/{len(cell_boxes)} invalid. "
|
||||
f"Page: {page_width:.0f}x{page_height:.0f}, Table bbox: {table_bbox}"
|
||||
)
|
||||
|
||||
return {
|
||||
'valid': invalid_count == 0,
|
||||
'invalid_count': invalid_count,
|
||||
'clamped_boxes': clamped_boxes,
|
||||
'issues': issues,
|
||||
'needs_fallback': invalid_count > len(cell_boxes) * 0.5 # >50% invalid = needs fallback
|
||||
}
|
||||
|
||||
|
||||
class OCRToUnifiedConverter:
|
||||
"""
|
||||
Converter for transforming PP-StructureV3 OCR results to UnifiedDocument format.
|
||||
@@ -337,19 +445,22 @@ class OCRToUnifiedConverter:
|
||||
for page_idx, page_result in enumerate(enhanced_results):
|
||||
elements = []
|
||||
|
||||
# Get page dimensions first (needed for element conversion)
|
||||
page_width = page_result.get('width', 0)
|
||||
page_height = page_result.get('height', 0)
|
||||
pp_dimensions = Dimensions(width=page_width, height=page_height)
|
||||
|
||||
# Process elements from parsing_res_list
|
||||
if 'elements' in page_result:
|
||||
for elem_data in page_result['elements']:
|
||||
element = self._convert_pp3_element(elem_data, page_idx)
|
||||
element = self._convert_pp3_element(
|
||||
elem_data, page_idx,
|
||||
page_width=page_width,
|
||||
page_height=page_height
|
||||
)
|
||||
if element:
|
||||
elements.append(element)
|
||||
|
||||
# Get page dimensions
|
||||
pp_dimensions = Dimensions(
|
||||
width=page_result.get('width', 0),
|
||||
height=page_result.get('height', 0)
|
||||
)
|
||||
|
||||
# Apply gap filling if enabled and raw regions available
|
||||
if self.gap_filling_service and raw_text_regions:
|
||||
# Filter raw regions for current page
|
||||
@@ -556,9 +667,19 @@ class OCRToUnifiedConverter:
|
||||
def _convert_pp3_element(
|
||||
self,
|
||||
elem_data: Dict[str, Any],
|
||||
page_idx: int
|
||||
page_idx: int,
|
||||
page_width: float = 0,
|
||||
page_height: float = 0
|
||||
) -> Optional[DocumentElement]:
|
||||
"""Convert PP-StructureV3 element to DocumentElement."""
|
||||
"""
|
||||
Convert PP-StructureV3 element to DocumentElement.
|
||||
|
||||
Args:
|
||||
elem_data: Element data from PP-StructureV3
|
||||
page_idx: Page index (0-based)
|
||||
page_width: Page width for coordinate validation
|
||||
page_height: Page height for coordinate validation
|
||||
"""
|
||||
try:
|
||||
# Extract bbox
|
||||
bbox_data = elem_data.get('bbox', [0, 0, 0, 0])
|
||||
@@ -597,18 +718,67 @@ class OCRToUnifiedConverter:
|
||||
# Preserve cell_boxes and embedded_images in metadata for PDF generation
|
||||
# These are extracted by PP-StructureV3 and provide accurate cell positioning
|
||||
if 'cell_boxes' in elem_data:
|
||||
elem_data.setdefault('metadata', {})['cell_boxes'] = elem_data['cell_boxes']
|
||||
elem_data['metadata']['cell_boxes_source'] = elem_data.get('cell_boxes_source', 'table_res_list')
|
||||
cell_boxes = elem_data['cell_boxes']
|
||||
elem_data.setdefault('metadata', {})['cell_boxes_source'] = elem_data.get('cell_boxes_source', 'table_res_list')
|
||||
|
||||
# Validate cell_boxes coordinates if page dimensions are available
|
||||
if page_width > 0 and page_height > 0:
|
||||
validation = validate_cell_boxes(
|
||||
cell_boxes=cell_boxes,
|
||||
table_bbox=bbox_data,
|
||||
page_width=page_width,
|
||||
page_height=page_height
|
||||
)
|
||||
|
||||
if not validation['valid']:
|
||||
elem_data['metadata']['cell_boxes_validation'] = {
|
||||
'valid': False,
|
||||
'invalid_count': validation['invalid_count'],
|
||||
'total_count': len(cell_boxes),
|
||||
'needs_fallback': validation['needs_fallback']
|
||||
}
|
||||
# Use clamped boxes instead of invalid ones
|
||||
elem_data['metadata']['cell_boxes'] = validation['clamped_boxes']
|
||||
elem_data['metadata']['cell_boxes_original'] = cell_boxes
|
||||
|
||||
if validation['needs_fallback']:
|
||||
logger.warning(
|
||||
f"Table {elem_data.get('element_id')}: "
|
||||
f"{validation['invalid_count']}/{len(cell_boxes)} cell_boxes invalid, "
|
||||
f"fallback recommended"
|
||||
)
|
||||
else:
|
||||
elem_data['metadata']['cell_boxes'] = cell_boxes
|
||||
elem_data['metadata']['cell_boxes_validation'] = {'valid': True}
|
||||
else:
|
||||
# No page dimensions available, store as-is
|
||||
elem_data['metadata']['cell_boxes'] = cell_boxes
|
||||
|
||||
if 'embedded_images' in elem_data:
|
||||
elem_data.setdefault('metadata', {})['embedded_images'] = elem_data['embedded_images']
|
||||
elif element_type in [ElementType.IMAGE, ElementType.FIGURE]:
|
||||
# For images, use metadata dict as content
|
||||
elif element_type in [
|
||||
ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART,
|
||||
ElementType.DIAGRAM, ElementType.LOGO, ElementType.STAMP
|
||||
]:
|
||||
# For all visual elements, use metadata dict as content
|
||||
# Priority: saved_path > img_path (PP-StructureV3 uses saved_path)
|
||||
image_path = (
|
||||
elem_data.get('saved_path') or
|
||||
elem_data.get('img_path') or
|
||||
''
|
||||
)
|
||||
content = {
|
||||
'path': elem_data.get('img_path', ''),
|
||||
'saved_path': image_path, # Preserve original path key
|
||||
'path': image_path, # For backward compatibility
|
||||
'width': elem_data.get('width', 0),
|
||||
'height': elem_data.get('height', 0),
|
||||
'format': elem_data.get('format', 'unknown')
|
||||
}
|
||||
if not image_path:
|
||||
logger.warning(
|
||||
f"Visual element {element_type.value} missing image path: "
|
||||
f"saved_path={elem_data.get('saved_path')}, img_path={elem_data.get('img_path')}"
|
||||
)
|
||||
else:
|
||||
content = elem_data.get('content', '')
|
||||
|
||||
@@ -1139,10 +1309,18 @@ class OCRToUnifiedConverter:
|
||||
for page_idx, page_data in enumerate(pages_data):
|
||||
elements = []
|
||||
|
||||
# Get page dimensions first
|
||||
page_width = page_data.get('width', 0)
|
||||
page_height = page_data.get('height', 0)
|
||||
|
||||
# Process each element in the page
|
||||
if 'elements' in page_data:
|
||||
for elem_data in page_data['elements']:
|
||||
element = self._convert_pp3_element(elem_data, page_idx)
|
||||
element = self._convert_pp3_element(
|
||||
elem_data, page_idx,
|
||||
page_width=page_width,
|
||||
page_height=page_height
|
||||
)
|
||||
if element:
|
||||
elements.append(element)
|
||||
|
||||
@@ -1150,8 +1328,8 @@ class OCRToUnifiedConverter:
|
||||
page = Page(
|
||||
page_number=page_idx + 1,
|
||||
dimensions=Dimensions(
|
||||
width=page_data.get('width', 0),
|
||||
height=page_data.get('height', 0)
|
||||
width=page_width,
|
||||
height=page_height
|
||||
),
|
||||
elements=elements,
|
||||
metadata={'reading_order': self._calculate_reading_order(elements)}
|
||||
|
||||
@@ -3371,18 +3371,21 @@ class PDFGeneratorService:
|
||||
"rows": 6,
|
||||
"cols": 2,
|
||||
"cells": [
|
||||
{"row": 0, "col": 0, "content": "..."},
|
||||
{"row": 0, "col": 0, "content": "...", "row_span": 1, "col_span": 2},
|
||||
{"row": 0, "col": 1, "content": "..."},
|
||||
...
|
||||
]
|
||||
}
|
||||
|
||||
Returns format compatible with HTMLTableParser output:
|
||||
Returns format compatible with HTMLTableParser output (with colspan/rowspan/col):
|
||||
[
|
||||
{"cells": [{"text": "..."}, {"text": "..."}]}, # row 0
|
||||
{"cells": [{"text": "..."}, {"text": "..."}]}, # row 1
|
||||
{"cells": [{"text": "...", "colspan": 1, "rowspan": 1, "col": 0}, ...]},
|
||||
{"cells": [{"text": "...", "colspan": 1, "rowspan": 1, "col": 0}, ...]},
|
||||
...
|
||||
]
|
||||
|
||||
Note: This returns actual cells per row with their absolute column positions.
|
||||
The table renderer uses 'col' to place cells correctly in the grid.
|
||||
"""
|
||||
try:
|
||||
num_rows = content.get('rows', 0)
|
||||
@@ -3392,21 +3395,39 @@ class PDFGeneratorService:
|
||||
if not cells or num_rows == 0 or num_cols == 0:
|
||||
return []
|
||||
|
||||
# Initialize rows structure
|
||||
rows_data = []
|
||||
for _ in range(num_rows):
|
||||
rows_data.append({'cells': [{'text': ''} for _ in range(num_cols)]})
|
||||
|
||||
# Fill in cell content
|
||||
# Group cells by row
|
||||
cells_by_row = {}
|
||||
for cell in cells:
|
||||
row_idx = cell.get('row', 0)
|
||||
col_idx = cell.get('col', 0)
|
||||
cell_content = cell.get('content', '')
|
||||
if row_idx not in cells_by_row:
|
||||
cells_by_row[row_idx] = []
|
||||
cells_by_row[row_idx].append(cell)
|
||||
|
||||
if 0 <= row_idx < num_rows and 0 <= col_idx < num_cols:
|
||||
rows_data[row_idx]['cells'][col_idx]['text'] = str(cell_content) if cell_content else ''
|
||||
# Sort cells within each row by column
|
||||
for row_idx in cells_by_row:
|
||||
cells_by_row[row_idx].sort(key=lambda c: c.get('col', 0))
|
||||
|
||||
logger.debug(f"Built {num_rows} rows from cells dict")
|
||||
# Build rows structure with colspan/rowspan info and absolute col position
|
||||
rows_data = []
|
||||
for row_idx in range(num_rows):
|
||||
row_cells = []
|
||||
if row_idx in cells_by_row:
|
||||
for cell in cells_by_row[row_idx]:
|
||||
cell_content = cell.get('content', '')
|
||||
row_span = cell.get('row_span', 1) or 1
|
||||
col_span = cell.get('col_span', 1) or 1
|
||||
col_idx = cell.get('col', 0)
|
||||
|
||||
row_cells.append({
|
||||
'text': str(cell_content) if cell_content else '',
|
||||
'rowspan': row_span,
|
||||
'colspan': col_span,
|
||||
'col': col_idx # Absolute column position
|
||||
})
|
||||
|
||||
rows_data.append({'cells': row_cells})
|
||||
|
||||
logger.debug(f"Built {num_rows} rows from cells dict with span info")
|
||||
return rows_data
|
||||
|
||||
except Exception as e:
|
||||
@@ -3471,19 +3492,115 @@ class PDFGeneratorService:
|
||||
table_width = bbox.x1 - bbox.x0
|
||||
table_height = bbox.y1 - bbox.y0
|
||||
|
||||
# Build table data for ReportLab
|
||||
table_content = []
|
||||
for row in rows:
|
||||
row_data = [cell['text'].strip() for cell in row['cells']]
|
||||
table_content.append(row_data)
|
||||
|
||||
# Create table
|
||||
from reportlab.platypus import Table, TableStyle
|
||||
from reportlab.lib import colors
|
||||
|
||||
# Determine number of rows and columns for cell_boxes calculation
|
||||
# Determine grid size from rows structure
|
||||
# Note: rows may have 'col' attribute for absolute positioning (from Direct extraction)
|
||||
# or may be sequential (from HTML parsing)
|
||||
num_rows = len(rows)
|
||||
max_cols = max(len(row['cells']) for row in rows) if rows else 0
|
||||
|
||||
# Check if cells have absolute column positions
|
||||
has_absolute_cols = any(
|
||||
'col' in cell
|
||||
for row in rows
|
||||
for cell in row['cells']
|
||||
)
|
||||
|
||||
# Calculate actual number of columns
|
||||
max_cols = 0
|
||||
if has_absolute_cols:
|
||||
# Use absolute col positions + colspan to find max column
|
||||
for row in rows:
|
||||
for cell in row['cells']:
|
||||
col = cell.get('col', 0)
|
||||
colspan = cell.get('colspan', 1)
|
||||
max_cols = max(max_cols, col + colspan)
|
||||
else:
|
||||
# Sequential cells: sum up colspans
|
||||
for row in rows:
|
||||
col_pos = 0
|
||||
for cell in row['cells']:
|
||||
colspan = cell.get('colspan', 1)
|
||||
col_pos += colspan
|
||||
max_cols = max(max_cols, col_pos)
|
||||
|
||||
# Build table data for ReportLab with proper grid structure
|
||||
# ReportLab needs a full grid with placeholders for spanned cells
|
||||
# and SPAN commands to merge them
|
||||
table_content = []
|
||||
span_commands = []
|
||||
covered = set() # Track cells covered by spans
|
||||
|
||||
# First pass: mark covered cells and collect SPAN commands
|
||||
for row_idx, row in enumerate(rows):
|
||||
if has_absolute_cols:
|
||||
# Use absolute column positions
|
||||
for cell in row['cells']:
|
||||
col_pos = cell.get('col', 0)
|
||||
colspan = cell.get('colspan', 1)
|
||||
rowspan = cell.get('rowspan', 1)
|
||||
|
||||
# Mark cells covered by this span
|
||||
if colspan > 1 or rowspan > 1:
|
||||
for r in range(row_idx, row_idx + rowspan):
|
||||
for c in range(col_pos, col_pos + colspan):
|
||||
if (r, c) != (row_idx, col_pos):
|
||||
covered.add((r, c))
|
||||
# Add SPAN command for ReportLab
|
||||
span_commands.append((
|
||||
'SPAN',
|
||||
(col_pos, row_idx),
|
||||
(col_pos + colspan - 1, row_idx + rowspan - 1)
|
||||
))
|
||||
else:
|
||||
# Sequential positioning
|
||||
col_pos = 0
|
||||
for cell in row['cells']:
|
||||
while (row_idx, col_pos) in covered:
|
||||
col_pos += 1
|
||||
|
||||
colspan = cell.get('colspan', 1)
|
||||
rowspan = cell.get('rowspan', 1)
|
||||
|
||||
if colspan > 1 or rowspan > 1:
|
||||
for r in range(row_idx, row_idx + rowspan):
|
||||
for c in range(col_pos, col_pos + colspan):
|
||||
if (r, c) != (row_idx, col_pos):
|
||||
covered.add((r, c))
|
||||
span_commands.append((
|
||||
'SPAN',
|
||||
(col_pos, row_idx),
|
||||
(col_pos + colspan - 1, row_idx + rowspan - 1)
|
||||
))
|
||||
col_pos += colspan
|
||||
|
||||
# Second pass: build content grid
|
||||
for row_idx in range(num_rows):
|
||||
row_data = [''] * max_cols
|
||||
|
||||
if row_idx < len(rows):
|
||||
if has_absolute_cols:
|
||||
# Place cells at their absolute positions
|
||||
for cell in rows[row_idx]['cells']:
|
||||
col_pos = cell.get('col', 0)
|
||||
if col_pos < max_cols:
|
||||
row_data[col_pos] = cell['text'].strip()
|
||||
else:
|
||||
# Sequential placement
|
||||
col_pos = 0
|
||||
for cell in rows[row_idx]['cells']:
|
||||
while col_pos < max_cols and (row_idx, col_pos) in covered:
|
||||
col_pos += 1
|
||||
if col_pos < max_cols:
|
||||
row_data[col_pos] = cell['text'].strip()
|
||||
colspan = cell.get('colspan', 1)
|
||||
col_pos += colspan
|
||||
|
||||
table_content.append(row_data)
|
||||
|
||||
logger.debug(f"Built table grid: {num_rows} rows × {max_cols} cols, {len(span_commands)} span commands (absolute_cols={has_absolute_cols})")
|
||||
|
||||
# Use original column widths from extraction if available
|
||||
# Otherwise try to compute from cell_boxes (from PP-StructureV3)
|
||||
@@ -3517,7 +3634,7 @@ class PDFGeneratorService:
|
||||
# Apply style with minimal padding to reduce table extension
|
||||
# Use Chinese font to support special characters (℃, μm, ≦, ×, Ω, etc.)
|
||||
font_for_table = self.font_name if self.font_registered else 'Helvetica'
|
||||
style = TableStyle([
|
||||
style_commands = [
|
||||
('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
|
||||
('FONTNAME', (0, 0), (-1, -1), font_for_table),
|
||||
('FONTSIZE', (0, 0), (-1, -1), 8),
|
||||
@@ -3529,7 +3646,13 @@ class PDFGeneratorService:
|
||||
('BOTTOMPADDING', (0, 0), (-1, -1), 0),
|
||||
('LEFTPADDING', (0, 0), (-1, -1), 1),
|
||||
('RIGHTPADDING', (0, 0), (-1, -1), 1),
|
||||
])
|
||||
]
|
||||
# Add span commands for merged cells
|
||||
style_commands.extend(span_commands)
|
||||
if span_commands:
|
||||
logger.info(f"Applied {len(span_commands)} SPAN commands for merged cells")
|
||||
|
||||
style = TableStyle(style_commands)
|
||||
t.setStyle(style)
|
||||
|
||||
# Use canvas scaling as fallback to fit table within bbox
|
||||
@@ -4350,33 +4473,100 @@ class PDFGeneratorService:
|
||||
# Replace newlines with <br/>
|
||||
safe_content = safe_content.replace('\n', '<br/>')
|
||||
|
||||
# Calculate font size from bbox height, but keep minimum 10pt
|
||||
font_size = max(box_height * 0.7, 10)
|
||||
font_size = min(font_size, 24) # Cap at 24pt
|
||||
# Get original font size from style info
|
||||
style_info = elem.get('style', {})
|
||||
original_font_size = style_info.get('font_size', 12.0)
|
||||
|
||||
# Create style for this element
|
||||
elem_style = ParagraphStyle(
|
||||
f'elem_{id(elem)}',
|
||||
parent=base_style,
|
||||
fontSize=font_size,
|
||||
leading=font_size * 1.2,
|
||||
# Detect vertical text (Y-axis labels, etc.)
|
||||
# Vertical text has aspect_ratio (height/width) > 2 and multiple characters
|
||||
is_vertical_text = (
|
||||
box_height > box_width * 2 and
|
||||
len(content.strip()) > 1
|
||||
)
|
||||
|
||||
# Create paragraph
|
||||
para = Paragraph(safe_content, elem_style)
|
||||
if is_vertical_text:
|
||||
# For vertical text, use original font size and rotate
|
||||
font_size = min(original_font_size, box_width * 0.9)
|
||||
font_size = max(font_size, 6) # Minimum 6pt
|
||||
|
||||
# Calculate available width and height
|
||||
available_width = box_width
|
||||
available_height = box_height * 2 # Allow overflow
|
||||
# Save canvas state for rotation
|
||||
pdf_canvas.saveState()
|
||||
|
||||
# Wrap the paragraph
|
||||
para_width, para_height = para.wrap(available_width, available_height)
|
||||
# Convert to PDF coordinates
|
||||
pdf_y_center = current_page_height - (y0 + y1) / 2
|
||||
x_center = (x0 + x1) / 2
|
||||
|
||||
# Convert to PDF coordinates (y from bottom)
|
||||
pdf_y = current_page_height - y0 - para_height
|
||||
# Translate to center, rotate, translate back
|
||||
pdf_canvas.translate(x_center, pdf_y_center)
|
||||
pdf_canvas.rotate(90)
|
||||
|
||||
# Draw the paragraph
|
||||
para.drawOn(pdf_canvas, x0, pdf_y)
|
||||
# Set font and draw text centered
|
||||
pdf_canvas.setFont(
|
||||
self.font_name if self.font_registered else 'Helvetica',
|
||||
font_size
|
||||
)
|
||||
# Draw text at origin (since we translated to center)
|
||||
text_width = pdf_canvas.stringWidth(
|
||||
safe_content.replace('&', '&').replace('<', '<').replace('>', '>'),
|
||||
self.font_name if self.font_registered else 'Helvetica',
|
||||
font_size
|
||||
)
|
||||
pdf_canvas.drawString(-text_width / 2, -font_size / 3,
|
||||
safe_content.replace('&', '&').replace('<', '<').replace('>', '>'))
|
||||
|
||||
pdf_canvas.restoreState()
|
||||
else:
|
||||
# For horizontal text, dynamically fit text within bbox
|
||||
# Start with original font size and reduce until text fits
|
||||
MIN_FONT_SIZE = 6
|
||||
MAX_FONT_SIZE = 14
|
||||
|
||||
if original_font_size > 0:
|
||||
start_font_size = min(original_font_size, MAX_FONT_SIZE)
|
||||
else:
|
||||
start_font_size = min(box_height * 0.7, MAX_FONT_SIZE)
|
||||
|
||||
font_size = max(start_font_size, MIN_FONT_SIZE)
|
||||
|
||||
# Try progressively smaller font sizes until text fits
|
||||
para = None
|
||||
para_height = box_height + 1 # Start with height > box to enter loop
|
||||
|
||||
while font_size >= MIN_FONT_SIZE and para_height > box_height:
|
||||
elem_style = ParagraphStyle(
|
||||
f'elem_{id(elem)}_{font_size}',
|
||||
parent=base_style,
|
||||
fontSize=font_size,
|
||||
leading=font_size * 1.15, # Tighter leading
|
||||
)
|
||||
|
||||
para = Paragraph(safe_content, elem_style)
|
||||
para_width, para_height = para.wrap(box_width, box_height * 3)
|
||||
|
||||
if para_height <= box_height:
|
||||
break # Text fits!
|
||||
|
||||
font_size -= 0.5 # Reduce font size and try again
|
||||
|
||||
# Ensure minimum font size
|
||||
if font_size < MIN_FONT_SIZE:
|
||||
font_size = MIN_FONT_SIZE
|
||||
elem_style = ParagraphStyle(
|
||||
f'elem_{id(elem)}_min',
|
||||
parent=base_style,
|
||||
fontSize=font_size,
|
||||
leading=font_size * 1.15,
|
||||
)
|
||||
para = Paragraph(safe_content, elem_style)
|
||||
para_width, para_height = para.wrap(box_width, box_height * 3)
|
||||
|
||||
# Convert to PDF coordinates (y from bottom)
|
||||
# Clip to bbox height to prevent overflow
|
||||
actual_height = min(para_height, box_height)
|
||||
pdf_y = current_page_height - y0 - actual_height
|
||||
|
||||
# Draw the paragraph
|
||||
para.drawOn(pdf_canvas, x0, pdf_y)
|
||||
|
||||
# Save PDF
|
||||
pdf_canvas.save()
|
||||
@@ -4451,13 +4641,47 @@ class PDFGeneratorService:
|
||||
pdf_y_bottom = page_height - ty1
|
||||
pdf_canvas.rect(tx0, pdf_y_bottom, table_width, table_height, stroke=1, fill=0)
|
||||
|
||||
# Step 2: Draw cell borders using cell_boxes
|
||||
# Step 2: Get or calculate cell boxes
|
||||
cell_boxes = metadata.get('cell_boxes', [])
|
||||
if cell_boxes:
|
||||
# Normalize cell boxes for grid alignment
|
||||
if hasattr(self, '_normalize_cell_boxes_to_grid'):
|
||||
cell_boxes = self._normalize_cell_boxes_to_grid(cell_boxes)
|
||||
|
||||
# If no cell_boxes, calculate from column_widths and row_heights
|
||||
if not cell_boxes:
|
||||
column_widths = metadata.get('column_widths', [])
|
||||
row_heights = metadata.get('row_heights', [])
|
||||
|
||||
if column_widths and row_heights:
|
||||
# Calculate cell positions from widths and heights
|
||||
cell_boxes = []
|
||||
rows = content.get('rows', len(row_heights)) if isinstance(content, dict) else len(row_heights)
|
||||
cols = content.get('cols', len(column_widths)) if isinstance(content, dict) else len(column_widths)
|
||||
|
||||
# Calculate cumulative positions
|
||||
x_positions = [tx0]
|
||||
for w in column_widths[:cols]:
|
||||
x_positions.append(x_positions[-1] + w)
|
||||
|
||||
y_positions = [ty0]
|
||||
for h in row_heights[:rows]:
|
||||
y_positions.append(y_positions[-1] + h)
|
||||
|
||||
# Create cell boxes for each cell (row-major order)
|
||||
for row_idx in range(rows):
|
||||
for col_idx in range(cols):
|
||||
if col_idx < len(x_positions) - 1 and row_idx < len(y_positions) - 1:
|
||||
cx0 = x_positions[col_idx]
|
||||
cy0 = y_positions[row_idx]
|
||||
cx1 = x_positions[col_idx + 1]
|
||||
cy1 = y_positions[row_idx + 1]
|
||||
cell_boxes.append([cx0, cy0, cx1, cy1])
|
||||
|
||||
logger.debug(f"Calculated {len(cell_boxes)} cell boxes from {cols} cols x {rows} rows")
|
||||
|
||||
# Normalize cell boxes for grid alignment
|
||||
if cell_boxes and hasattr(self, '_normalize_cell_boxes_to_grid'):
|
||||
cell_boxes = self._normalize_cell_boxes_to_grid(cell_boxes)
|
||||
|
||||
# Draw cell borders
|
||||
if cell_boxes:
|
||||
pdf_canvas.setLineWidth(0.5)
|
||||
for box in cell_boxes:
|
||||
if len(box) >= 4:
|
||||
|
||||
@@ -558,8 +558,8 @@ class PPStructureEnhanced:
|
||||
element['embedded_images'] = embedded_images
|
||||
logger.info(f"[TABLE] Embedded {len(embedded_images)} images into table")
|
||||
|
||||
# Special handling for images/figures/stamps (visual elements that need cropping)
|
||||
elif mapped_type in [ElementType.IMAGE, ElementType.FIGURE, ElementType.STAMP, ElementType.LOGO]:
|
||||
# Special handling for images/figures/charts/stamps (visual elements that need cropping)
|
||||
elif mapped_type in [ElementType.IMAGE, ElementType.FIGURE, ElementType.CHART, ElementType.DIAGRAM, ElementType.STAMP, ElementType.LOGO]:
|
||||
# Save image if path provided
|
||||
if 'img_path' in item and output_dir:
|
||||
saved_path = self._save_image(item['img_path'], output_dir, element['element_id'])
|
||||
|
||||
43
backend/tests/debug_table_cells.py
Normal file
43
backend/tests/debug_table_cells.py
Normal file
@@ -0,0 +1,43 @@
|
||||
"""Debug PyMuPDF table.cells structure"""
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
import fitz
|
||||
|
||||
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
|
||||
doc = fitz.open(str(pdf_path))
|
||||
page = doc[0]
|
||||
|
||||
tables = page.find_tables()
|
||||
for idx, table in enumerate(tables.tables):
|
||||
data = table.extract()
|
||||
num_rows = len(data)
|
||||
num_cols = max(len(row) for row in data) if data else 0
|
||||
|
||||
print(f"Table {idx}:")
|
||||
print(f" table.extract() dimensions: {num_rows} rows x {num_cols} cols")
|
||||
print(f" Expected positions: {num_rows * num_cols}")
|
||||
|
||||
cell_rects = getattr(table, 'cells', None)
|
||||
if cell_rects:
|
||||
print(f" table.cells length: {len(cell_rects)}")
|
||||
none_count = sum(1 for c in cell_rects if c is None)
|
||||
actual_count = sum(1 for c in cell_rects if c is not None)
|
||||
print(f" None cells: {none_count}")
|
||||
print(f" Actual cells: {actual_count}")
|
||||
|
||||
# Check if cell_rects matches grid size
|
||||
if len(cell_rects) != num_rows * num_cols:
|
||||
print(f" WARNING: cell_rects length ({len(cell_rects)}) != grid size ({num_rows * num_cols})")
|
||||
|
||||
# Show first few cells
|
||||
print(f" First 5 cells: {cell_rects[:5]}")
|
||||
else:
|
||||
print(f" table.cells: NOT AVAILABLE")
|
||||
|
||||
# Check row_count and col_count
|
||||
print(f" table.row_count: {getattr(table, 'row_count', 'N/A')}")
|
||||
print(f" table.col_count: {getattr(table, 'col_count', 'N/A')}")
|
||||
|
||||
doc.close()
|
||||
48
backend/tests/debug_table_cells2.py
Normal file
48
backend/tests/debug_table_cells2.py
Normal file
@@ -0,0 +1,48 @@
|
||||
"""Debug PyMuPDF table structure - find merge info"""
|
||||
import sys
|
||||
from pathlib import Path
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
import fitz
|
||||
|
||||
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
|
||||
doc = fitz.open(str(pdf_path))
|
||||
page = doc[0]
|
||||
|
||||
tables = page.find_tables()
|
||||
for idx, table in enumerate(tables.tables):
|
||||
print(f"\nTable {idx}:")
|
||||
|
||||
# Check all available attributes
|
||||
print(f" Available attributes: {[a for a in dir(table) if not a.startswith('_')]}")
|
||||
|
||||
# Try to get header info
|
||||
if hasattr(table, 'header'):
|
||||
print(f" header: {table.header}")
|
||||
|
||||
# Check for cells info
|
||||
cell_rects = table.cells
|
||||
print(f" cells count: {len(cell_rects)}")
|
||||
|
||||
# Get the extracted data
|
||||
data = table.extract()
|
||||
print(f" extract() shape: {len(data)} x {max(len(r) for r in data)}")
|
||||
|
||||
# Check if there's a way to map cells to grid positions
|
||||
# Look at the pandas output which might have merge info
|
||||
try:
|
||||
df = table.to_pandas()
|
||||
print(f" pandas shape: {df.shape}")
|
||||
except Exception as e:
|
||||
print(f" pandas error: {e}")
|
||||
|
||||
# Check the TableRow objects if available
|
||||
if hasattr(table, 'rows'):
|
||||
rows = table.rows
|
||||
print(f" rows: {len(rows)}")
|
||||
for ri, row in enumerate(rows[:3]): # first 3 rows
|
||||
print(f" row {ri}: {len(row.cells)} cells")
|
||||
for ci, cell in enumerate(row.cells[:5]): # first 5 cells
|
||||
print(f" cell {ci}: bbox={cell}")
|
||||
|
||||
doc.close()
|
||||
111
backend/tests/generate_test_pdf.py
Normal file
111
backend/tests/generate_test_pdf.py
Normal file
@@ -0,0 +1,111 @@
|
||||
"""
|
||||
Generate test PDF to verify Phase 1 fixes
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
# Add backend to path
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
from app.services.direct_extraction_engine import DirectExtractionEngine
|
||||
from app.services.pdf_generator_service import PDFGeneratorService
|
||||
from app.services.unified_document_exporter import UnifiedDocumentExporter
|
||||
|
||||
|
||||
def generate_test_pdf(input_pdf: str, output_dir: Path):
|
||||
"""Generate test PDF using Direct Track extraction"""
|
||||
|
||||
input_path = Path(input_pdf)
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
print(f"Processing: {input_path.name}")
|
||||
print(f"Output dir: {output_dir}")
|
||||
|
||||
# Step 1: Extract with Direct Track
|
||||
engine = DirectExtractionEngine(
|
||||
enable_table_detection=True,
|
||||
enable_image_extraction=True,
|
||||
min_image_area=200.0, # Filter tiny images
|
||||
enable_whiteout_detection=True,
|
||||
enable_content_sanitization=True
|
||||
)
|
||||
|
||||
unified_doc = engine.extract(input_path, output_dir=output_dir)
|
||||
|
||||
# Print extraction stats
|
||||
print(f"\n=== Extraction Results ===")
|
||||
print(f"Document ID: {unified_doc.document_id}")
|
||||
print(f"Pages: {len(unified_doc.pages)}")
|
||||
|
||||
table_count = 0
|
||||
image_count = 0
|
||||
merged_cells = 0
|
||||
total_cells = 0
|
||||
|
||||
for page in unified_doc.pages:
|
||||
for elem in page.elements:
|
||||
if elem.type.value == 'table':
|
||||
table_count += 1
|
||||
if elem.content and hasattr(elem.content, 'cells'):
|
||||
total_cells += len(elem.content.cells)
|
||||
for cell in elem.content.cells:
|
||||
if cell.row_span > 1 or cell.col_span > 1:
|
||||
merged_cells += 1
|
||||
elif elem.type.value == 'image':
|
||||
image_count += 1
|
||||
|
||||
print(f"Tables: {table_count}")
|
||||
print(f" - Total cells: {total_cells}")
|
||||
print(f" - Merged cells: {merged_cells}")
|
||||
print(f"Images: {image_count}")
|
||||
|
||||
# Step 2: Export to JSON
|
||||
exporter = UnifiedDocumentExporter()
|
||||
json_path = output_dir / f"{input_path.stem}_result.json"
|
||||
exporter.export_to_json(unified_doc, json_path)
|
||||
print(f"\nJSON saved: {json_path}")
|
||||
|
||||
# Step 3: Generate layout PDF
|
||||
pdf_generator = PDFGeneratorService()
|
||||
pdf_path = output_dir / f"{input_path.stem}_layout.pdf"
|
||||
|
||||
try:
|
||||
pdf_generator.generate_from_unified_document(
|
||||
unified_doc=unified_doc,
|
||||
output_path=pdf_path,
|
||||
source_file_path=input_path
|
||||
)
|
||||
print(f"PDF saved: {pdf_path}")
|
||||
return pdf_path
|
||||
except Exception as e:
|
||||
print(f"PDF generation error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return None
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Test with edit3.pdf (has complex tables with merging)
|
||||
demo_docs = Path(__file__).parent.parent.parent / "demo_docs"
|
||||
output_base = Path(__file__).parent.parent / "storage" / "test_phase1"
|
||||
|
||||
# Process edit3.pdf
|
||||
edit3_pdf = demo_docs / "edit3.pdf"
|
||||
if edit3_pdf.exists():
|
||||
output_dir = output_base / "edit3"
|
||||
result = generate_test_pdf(str(edit3_pdf), output_dir)
|
||||
if result:
|
||||
print(f"\n✓ Test PDF generated: {result}")
|
||||
|
||||
# Also process edit.pdf for comparison
|
||||
edit_pdf = demo_docs / "edit.pdf"
|
||||
if edit_pdf.exists():
|
||||
output_dir = output_base / "edit"
|
||||
result = generate_test_pdf(str(edit_pdf), output_dir)
|
||||
if result:
|
||||
print(f"\n✓ Test PDF generated: {result}")
|
||||
|
||||
print(f"\n=== Output Location ===")
|
||||
print(f"{output_base}")
|
||||
285
backend/tests/test_phase1_fixes.py
Normal file
285
backend/tests/test_phase1_fixes.py
Normal file
@@ -0,0 +1,285 @@
|
||||
"""
|
||||
Phase 1 Bug Fixes Verification Tests
|
||||
|
||||
Tests for:
|
||||
1.1 Direct Track table cell merging
|
||||
1.2 OCR Track image path preservation
|
||||
1.3 Cell boxes coordinate validation
|
||||
1.4 Tiny decoration image filtering
|
||||
1.5 Covering image removal
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
# Add backend to path
|
||||
sys.path.insert(0, str(Path(__file__).parent.parent))
|
||||
|
||||
import fitz
|
||||
from app.services.direct_extraction_engine import DirectExtractionEngine
|
||||
from app.services.ocr_to_unified_converter import validate_cell_boxes
|
||||
from app.models.unified_document import TableCell
|
||||
|
||||
|
||||
def test_1_1_table_cell_merging():
|
||||
"""Test 1.1.5: Verify edit3.pdf returns correct merged cells"""
|
||||
print("\n" + "="*60)
|
||||
print("TEST 1.1: Direct Track Table Cell Merging")
|
||||
print("="*60)
|
||||
|
||||
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
|
||||
if not pdf_path.exists():
|
||||
print(f"SKIP: {pdf_path} not found")
|
||||
return False
|
||||
|
||||
doc = fitz.open(str(pdf_path))
|
||||
|
||||
total_cells = 0
|
||||
merged_cells = 0
|
||||
|
||||
for page_num, page in enumerate(doc):
|
||||
tables = page.find_tables()
|
||||
for table_idx, table in enumerate(tables.tables):
|
||||
data = table.extract()
|
||||
cell_rects = getattr(table, 'cells', None)
|
||||
|
||||
if cell_rects:
|
||||
num_rows = len(data)
|
||||
num_cols = max(len(row) for row in data) if data else 0
|
||||
|
||||
# Count actual cells (non-None)
|
||||
actual_cells = sum(1 for c in cell_rects if c is not None)
|
||||
none_cells = sum(1 for c in cell_rects if c is None)
|
||||
|
||||
print(f" Page {page_num}, Table {table_idx}:")
|
||||
print(f" Grid size: {num_rows} x {num_cols} = {num_rows * num_cols} positions")
|
||||
print(f" Actual cells: {actual_cells}")
|
||||
print(f" Merged positions (None): {none_cells}")
|
||||
|
||||
total_cells += actual_cells
|
||||
if none_cells > 0:
|
||||
merged_cells += 1
|
||||
|
||||
doc.close()
|
||||
|
||||
print(f"\n Total actual cells across all tables: {total_cells}")
|
||||
print(f" Tables with merging: {merged_cells}")
|
||||
|
||||
# According to PLAN.md, edit3.pdf should have 83 cells (not 204)
|
||||
# The presence of None values indicates merging is detected
|
||||
if total_cells > 0 and total_cells < 204:
|
||||
print(" RESULT: PASS - Cell merging detected correctly")
|
||||
return True
|
||||
elif total_cells == 204:
|
||||
print(" RESULT: FAIL - All cells treated as 1x1 (no merging detected)")
|
||||
return False
|
||||
else:
|
||||
print(f" RESULT: INCONCLUSIVE - {total_cells} cells found")
|
||||
return None
|
||||
|
||||
|
||||
def test_1_3_cell_boxes_validation():
|
||||
"""Test 1.3: Verify cell_boxes coordinate validation"""
|
||||
print("\n" + "="*60)
|
||||
print("TEST 1.3: Cell Boxes Coordinate Validation")
|
||||
print("="*60)
|
||||
|
||||
# Test case 1: Valid coordinates
|
||||
valid_boxes = [
|
||||
[10, 10, 100, 50],
|
||||
[100, 10, 200, 50],
|
||||
[10, 50, 200, 100]
|
||||
]
|
||||
result = validate_cell_boxes(valid_boxes, [0, 0, 300, 200], 300, 200)
|
||||
print(f" Valid boxes: valid={result['valid']}, invalid_count={result['invalid_count']}")
|
||||
assert result['valid'], "Valid boxes should pass validation"
|
||||
|
||||
# Test case 2: Out of bounds coordinates
|
||||
invalid_boxes = [
|
||||
[-10, 10, 100, 50], # x0 < 0
|
||||
[10, 10, 400, 50], # x1 > page_width
|
||||
[10, 10, 100, 300] # y1 > page_height
|
||||
]
|
||||
result = validate_cell_boxes(invalid_boxes, [0, 0, 300, 200], 300, 200)
|
||||
print(f" Invalid boxes: valid={result['valid']}, invalid_count={result['invalid_count']}")
|
||||
assert not result['valid'], "Invalid boxes should fail validation"
|
||||
assert result['invalid_count'] == 3, "Should detect 3 invalid boxes"
|
||||
|
||||
# Test case 3: Clamping
|
||||
assert len(result['clamped_boxes']) == 3, "Should return clamped boxes"
|
||||
clamped = result['clamped_boxes'][0]
|
||||
assert clamped[0] >= 0, "Clamped x0 should be >= 0"
|
||||
|
||||
print(" RESULT: PASS - Coordinate validation works correctly")
|
||||
return True
|
||||
|
||||
|
||||
def test_1_4_tiny_image_filtering():
|
||||
"""Test 1.4: Verify tiny decoration image filtering"""
|
||||
print("\n" + "="*60)
|
||||
print("TEST 1.4: Tiny Decoration Image Filtering")
|
||||
print("="*60)
|
||||
|
||||
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
|
||||
if not pdf_path.exists():
|
||||
print(f"SKIP: {pdf_path} not found")
|
||||
return None
|
||||
|
||||
doc = fitz.open(str(pdf_path))
|
||||
|
||||
tiny_count = 0
|
||||
normal_count = 0
|
||||
min_area = 200 # Same threshold as in DirectExtractionEngine
|
||||
|
||||
for page_num, page in enumerate(doc):
|
||||
images = page.get_images()
|
||||
for img in images:
|
||||
xref = img[0]
|
||||
rects = page.get_image_rects(xref)
|
||||
if rects:
|
||||
rect = rects[0]
|
||||
area = (rect.x1 - rect.x0) * (rect.y1 - rect.y0)
|
||||
if area < min_area:
|
||||
tiny_count += 1
|
||||
print(f" Page {page_num}: Tiny image xref={xref}, area={area:.1f} px²")
|
||||
else:
|
||||
normal_count += 1
|
||||
|
||||
doc.close()
|
||||
|
||||
print(f"\n Tiny images (< {min_area} px²): {tiny_count}")
|
||||
print(f" Normal images: {normal_count}")
|
||||
|
||||
if tiny_count > 0:
|
||||
print(" RESULT: PASS - Tiny images detected, will be filtered")
|
||||
return True
|
||||
else:
|
||||
print(" RESULT: INFO - No tiny images found in test file")
|
||||
return None
|
||||
|
||||
|
||||
def test_1_5_covering_image_detection():
|
||||
"""Test 1.5: Verify covering image detection"""
|
||||
print("\n" + "="*60)
|
||||
print("TEST 1.5: Covering Image Detection")
|
||||
print("="*60)
|
||||
|
||||
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
|
||||
if not pdf_path.exists():
|
||||
print(f"SKIP: {pdf_path} not found")
|
||||
return None
|
||||
|
||||
engine = DirectExtractionEngine(
|
||||
enable_whiteout_detection=True,
|
||||
whiteout_iou_threshold=0.8
|
||||
)
|
||||
|
||||
doc = fitz.open(str(pdf_path))
|
||||
|
||||
total_covering = 0
|
||||
for page_num, page in enumerate(doc):
|
||||
result = engine._preprocess_page(page, page_num, doc)
|
||||
covering_images = result.get('covering_images', [])
|
||||
|
||||
if covering_images:
|
||||
print(f" Page {page_num}: {len(covering_images)} covering images detected")
|
||||
for img in covering_images[:3]: # Show first 3
|
||||
print(f" - xref={img.get('xref')}, type={img.get('color_type')}, "
|
||||
f"bbox={[round(x, 1) for x in img.get('bbox', [])]}")
|
||||
total_covering += len(covering_images)
|
||||
|
||||
doc.close()
|
||||
|
||||
print(f"\n Total covering images detected: {total_covering}")
|
||||
|
||||
if total_covering > 0:
|
||||
print(" RESULT: PASS - Covering images detected, will be filtered")
|
||||
return True
|
||||
else:
|
||||
print(" RESULT: INFO - No covering images found in test file")
|
||||
return None
|
||||
|
||||
|
||||
def test_direct_extraction_full():
|
||||
"""Full integration test for Direct Track extraction"""
|
||||
print("\n" + "="*60)
|
||||
print("INTEGRATION TEST: Direct Track Full Extraction")
|
||||
print("="*60)
|
||||
|
||||
pdf_path = Path(__file__).parent.parent.parent / "demo_docs" / "edit3.pdf"
|
||||
if not pdf_path.exists():
|
||||
print(f"SKIP: {pdf_path} not found")
|
||||
return None
|
||||
|
||||
engine = DirectExtractionEngine(
|
||||
enable_table_detection=True,
|
||||
enable_image_extraction=True,
|
||||
min_image_area=200.0,
|
||||
enable_whiteout_detection=True
|
||||
)
|
||||
|
||||
try:
|
||||
result = engine.extract(pdf_path) # Pass Path object, not string
|
||||
|
||||
# Count elements
|
||||
table_count = 0
|
||||
image_count = 0
|
||||
merged_table_count = 0
|
||||
|
||||
for page in result.pages:
|
||||
for elem in page.elements:
|
||||
if elem.type.value == 'table':
|
||||
table_count += 1
|
||||
if elem.content and hasattr(elem.content, 'cells'):
|
||||
# Check for merged cells
|
||||
for cell in elem.content.cells:
|
||||
if cell.row_span > 1 or cell.col_span > 1:
|
||||
merged_table_count += 1
|
||||
break
|
||||
elif elem.type.value == 'image':
|
||||
image_count += 1
|
||||
|
||||
print(f" Document ID: {result.document_id}")
|
||||
print(f" Pages: {len(result.pages)}")
|
||||
print(f" Tables: {table_count} (with merging: {merged_table_count})")
|
||||
print(f" Images: {image_count}")
|
||||
|
||||
print(" RESULT: PASS - Extraction completed successfully")
|
||||
return True
|
||||
|
||||
except Exception as e:
|
||||
print(f" RESULT: FAIL - {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return False
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("="*60)
|
||||
print("Phase 1 Bug Fixes Verification Tests")
|
||||
print("="*60)
|
||||
|
||||
results = {}
|
||||
|
||||
# Run tests
|
||||
results['1.1_table_merging'] = test_1_1_table_cell_merging()
|
||||
results['1.3_coord_validation'] = test_1_3_cell_boxes_validation()
|
||||
results['1.4_tiny_filtering'] = test_1_4_tiny_image_filtering()
|
||||
results['1.5_covering_detection'] = test_1_5_covering_image_detection()
|
||||
results['integration'] = test_direct_extraction_full()
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*60)
|
||||
print("TEST SUMMARY")
|
||||
print("="*60)
|
||||
|
||||
for test_name, result in results.items():
|
||||
status = "PASS" if result is True else "FAIL" if result is False else "SKIP/INFO"
|
||||
print(f" {test_name}: {status}")
|
||||
|
||||
passed = sum(1 for r in results.values() if r is True)
|
||||
failed = sum(1 for r in results.values() if r is False)
|
||||
skipped = sum(1 for r in results.values() if r is None)
|
||||
|
||||
print(f"\n Total: {passed} passed, {failed} failed, {skipped} skipped/info")
|
||||
148
frontend/src/components/ProcessingTrackSelector.tsx
Normal file
148
frontend/src/components/ProcessingTrackSelector.tsx
Normal file
@@ -0,0 +1,148 @@
|
||||
import { Card, CardContent, CardHeader, CardTitle } from '@/components/ui/card'
|
||||
import { Badge } from '@/components/ui/badge'
|
||||
import { Cpu, FileText, Sparkles, Info } from 'lucide-react'
|
||||
import type { ProcessingTrack, DocumentAnalysisResponse } from '@/types/apiV2'
|
||||
|
||||
interface ProcessingTrackSelectorProps {
|
||||
value: ProcessingTrack | null // null means "use system recommendation"
|
||||
onChange: (track: ProcessingTrack | null) => void
|
||||
documentAnalysis?: DocumentAnalysisResponse | null
|
||||
disabled?: boolean
|
||||
}
|
||||
|
||||
export default function ProcessingTrackSelector({
|
||||
value,
|
||||
onChange,
|
||||
documentAnalysis,
|
||||
disabled = false,
|
||||
}: ProcessingTrackSelectorProps) {
|
||||
const recommendedTrack = documentAnalysis?.recommended_track
|
||||
|
||||
const tracks = [
|
||||
{
|
||||
id: null as ProcessingTrack | null,
|
||||
name: '自動選擇',
|
||||
description: '根據文件類型自動選擇最佳處理方式',
|
||||
icon: Sparkles,
|
||||
color: 'text-purple-600',
|
||||
bgColor: 'bg-purple-50',
|
||||
borderColor: 'border-purple-200',
|
||||
recommended: false,
|
||||
},
|
||||
{
|
||||
id: 'direct' as ProcessingTrack,
|
||||
name: '直接提取 (DIRECT)',
|
||||
description: '從 PDF 中直接提取文字圖層,適用於可編輯 PDF',
|
||||
icon: FileText,
|
||||
color: 'text-blue-600',
|
||||
bgColor: 'bg-blue-50',
|
||||
borderColor: 'border-blue-200',
|
||||
recommended: recommendedTrack === 'direct',
|
||||
},
|
||||
{
|
||||
id: 'ocr' as ProcessingTrack,
|
||||
name: 'OCR 識別',
|
||||
description: '使用光學字元識別處理圖片或掃描文件',
|
||||
icon: Cpu,
|
||||
color: 'text-green-600',
|
||||
bgColor: 'bg-green-50',
|
||||
borderColor: 'border-green-200',
|
||||
recommended: recommendedTrack === 'ocr',
|
||||
},
|
||||
]
|
||||
|
||||
return (
|
||||
<Card>
|
||||
<CardHeader>
|
||||
<div className="flex items-center gap-3">
|
||||
<div className="p-2 bg-primary/10 rounded-lg">
|
||||
<Sparkles className="w-5 h-5 text-primary" />
|
||||
</div>
|
||||
<div>
|
||||
<CardTitle>處理方式選擇</CardTitle>
|
||||
<p className="text-sm text-muted-foreground mt-1">
|
||||
選擇文件的處理方式,或讓系統自動判斷
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</CardHeader>
|
||||
<CardContent className="space-y-3">
|
||||
{/* Info about override */}
|
||||
{value !== null && recommendedTrack && value !== recommendedTrack && (
|
||||
<div className="flex items-start gap-2 p-3 bg-amber-50 border border-amber-200 rounded-lg">
|
||||
<Info className="w-4 h-4 text-amber-600 flex-shrink-0 mt-0.5" />
|
||||
<p className="text-sm text-amber-800">
|
||||
您已覆蓋系統建議。系統原本建議使用「{recommendedTrack === 'direct' ? '直接提取' : 'OCR 識別'}」方式處理此文件。
|
||||
</p>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Track options */}
|
||||
<div className="grid gap-3">
|
||||
{tracks.map((track) => {
|
||||
const isSelected = value === track.id
|
||||
const Icon = track.icon
|
||||
|
||||
return (
|
||||
<button
|
||||
key={track.id ?? 'auto'}
|
||||
type="button"
|
||||
disabled={disabled}
|
||||
onClick={() => onChange(track.id)}
|
||||
className={`
|
||||
w-full p-4 rounded-lg border-2 text-left transition-all
|
||||
${isSelected
|
||||
? `${track.borderColor} ${track.bgColor}`
|
||||
: 'border-border hover:border-primary/30 hover:bg-muted/30'
|
||||
}
|
||||
${disabled ? 'opacity-50 cursor-not-allowed' : 'cursor-pointer'}
|
||||
`}
|
||||
>
|
||||
<div className="flex items-start gap-3">
|
||||
<div className={`p-2 rounded-lg ${isSelected ? track.bgColor : 'bg-muted'}`}>
|
||||
<Icon className={`w-5 h-5 ${isSelected ? track.color : 'text-muted-foreground'}`} />
|
||||
</div>
|
||||
<div className="flex-1 min-w-0">
|
||||
<div className="flex items-center gap-2">
|
||||
<span className={`font-medium ${isSelected ? track.color : ''}`}>
|
||||
{track.name}
|
||||
</span>
|
||||
{track.recommended && (
|
||||
<Badge variant="outline" className="text-xs bg-white">
|
||||
系統建議
|
||||
</Badge>
|
||||
)}
|
||||
{isSelected && (
|
||||
<Badge variant="default" className="text-xs">
|
||||
已選擇
|
||||
</Badge>
|
||||
)}
|
||||
</div>
|
||||
<p className="text-sm text-muted-foreground mt-1">
|
||||
{track.description}
|
||||
</p>
|
||||
</div>
|
||||
</div>
|
||||
</button>
|
||||
)
|
||||
})}
|
||||
</div>
|
||||
|
||||
{/* Current analysis info */}
|
||||
{documentAnalysis && (
|
||||
<div className="pt-3 border-t border-border">
|
||||
<div className="flex flex-wrap gap-x-4 gap-y-1 text-xs text-muted-foreground">
|
||||
<span>文件分析信心度: {(documentAnalysis.confidence * 100).toFixed(0)}%</span>
|
||||
{documentAnalysis.page_count && (
|
||||
<span>頁數: {documentAnalysis.page_count}</span>
|
||||
)}
|
||||
{documentAnalysis.text_coverage !== null && (
|
||||
<span>文字覆蓋率: {(documentAnalysis.text_coverage * 100).toFixed(1)}%</span>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</CardContent>
|
||||
</Card>
|
||||
)
|
||||
}
|
||||
@@ -8,14 +8,15 @@ import { Button } from '@/components/ui/button'
|
||||
import { Badge } from '@/components/ui/badge'
|
||||
import { useToast } from '@/components/ui/toast'
|
||||
import { apiClientV2 } from '@/services/apiV2'
|
||||
import { Play, CheckCircle, FileText, AlertCircle, Clock, Activity, Loader2, Info } from 'lucide-react'
|
||||
import { Play, CheckCircle, FileText, AlertCircle, Clock, Activity, Loader2 } from 'lucide-react'
|
||||
import LayoutModelSelector from '@/components/LayoutModelSelector'
|
||||
import PreprocessingSettings from '@/components/PreprocessingSettings'
|
||||
import PreprocessingPreview from '@/components/PreprocessingPreview'
|
||||
import TableDetectionSelector from '@/components/TableDetectionSelector'
|
||||
import ProcessingTrackSelector from '@/components/ProcessingTrackSelector'
|
||||
import TaskNotFound from '@/components/TaskNotFound'
|
||||
import { useTaskValidation } from '@/hooks/useTaskValidation'
|
||||
import type { LayoutModel, ProcessingOptions, PreprocessingMode, PreprocessingConfig, TableDetectionConfig, DocumentAnalysisResponse } from '@/types/apiV2'
|
||||
import type { LayoutModel, ProcessingOptions, PreprocessingMode, PreprocessingConfig, TableDetectionConfig, ProcessingTrack } from '@/types/apiV2'
|
||||
|
||||
export default function ProcessingPage() {
|
||||
const { t } = useTranslation()
|
||||
@@ -56,6 +57,9 @@ export default function ProcessingPage() {
|
||||
enable_region_detection: true,
|
||||
})
|
||||
|
||||
// Processing track override state (null = use system recommendation)
|
||||
const [forceTrack, setForceTrack] = useState<ProcessingTrack | null>(null)
|
||||
|
||||
// Analyze document to determine if OCR is needed (only for pending tasks)
|
||||
const { data: documentAnalysis, isLoading: isAnalyzing } = useQuery({
|
||||
queryKey: ['documentAnalysis', taskId],
|
||||
@@ -65,16 +69,23 @@ export default function ProcessingPage() {
|
||||
})
|
||||
|
||||
// Determine if preprocessing options should be shown
|
||||
// Only show for OCR track files (images and non-editable PDFs)
|
||||
const needsOcrTrack = documentAnalysis?.recommended_track === 'ocr' ||
|
||||
documentAnalysis?.recommended_track === 'hybrid' ||
|
||||
!documentAnalysis // Show by default while analyzing
|
||||
// Show OCR options when:
|
||||
// 1. User explicitly selected OCR track
|
||||
// 2. OR system recommends OCR/hybrid track (and user hasn't overridden to direct)
|
||||
// 3. OR still analyzing (show by default)
|
||||
const needsOcrTrack = forceTrack === 'ocr' ||
|
||||
(forceTrack === null && (
|
||||
documentAnalysis?.recommended_track === 'ocr' ||
|
||||
documentAnalysis?.recommended_track === 'hybrid' ||
|
||||
!documentAnalysis
|
||||
))
|
||||
|
||||
// Start OCR processing
|
||||
const processOCRMutation = useMutation({
|
||||
mutationFn: () => {
|
||||
const options: ProcessingOptions = {
|
||||
use_dual_track: true,
|
||||
use_dual_track: forceTrack === null, // Only use dual-track auto-detection if not forcing
|
||||
force_track: forceTrack || undefined, // Pass force_track if user selected one
|
||||
language: 'ch',
|
||||
layout_model: layoutModel,
|
||||
preprocessing_mode: preprocessingMode,
|
||||
@@ -392,53 +403,14 @@ export default function ProcessingPage() {
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Document Analysis Info */}
|
||||
{documentAnalysis && (
|
||||
<Card className={documentAnalysis.recommended_track === 'direct' ? 'border-blue-200 bg-blue-50' : 'border-green-200 bg-green-50'}>
|
||||
<CardContent className="pt-4">
|
||||
<div className="flex items-start gap-3">
|
||||
<Info className={`w-5 h-5 flex-shrink-0 mt-0.5 ${documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}`} />
|
||||
<div className="flex-1">
|
||||
{documentAnalysis.recommended_track === 'direct' ? (
|
||||
<>
|
||||
<p className="text-sm font-medium text-blue-800">此文件為可編輯 PDF</p>
|
||||
<p className="text-sm text-blue-700 mt-1">
|
||||
系統偵測到此 PDF 包含文字圖層,將使用直接文字提取方式處理。
|
||||
版面偵測和影像前處理設定不適用於此類文件。
|
||||
</p>
|
||||
</>
|
||||
) : (
|
||||
<>
|
||||
<p className="text-sm font-medium text-green-800">
|
||||
{documentAnalysis.is_editable ? '混合文件' : '掃描文件 / 影像'}
|
||||
</p>
|
||||
<p className="text-sm text-green-700 mt-1">
|
||||
{documentAnalysis.reason}
|
||||
</p>
|
||||
</>
|
||||
)}
|
||||
<div className="flex flex-wrap gap-4 mt-2 text-xs">
|
||||
<span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
|
||||
處理方式: {documentAnalysis.recommended_track === 'direct' ? '直接提取' : documentAnalysis.recommended_track === 'ocr' ? 'OCR 識別' : '混合處理'}
|
||||
</span>
|
||||
{documentAnalysis.page_count && (
|
||||
<span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
|
||||
頁數: {documentAnalysis.page_count}
|
||||
</span>
|
||||
)}
|
||||
{documentAnalysis.text_coverage !== null && (
|
||||
<span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
|
||||
文字覆蓋率: {(documentAnalysis.text_coverage * 100).toFixed(1)}%
|
||||
</span>
|
||||
)}
|
||||
<span className={documentAnalysis.recommended_track === 'direct' ? 'text-blue-600' : 'text-green-600'}>
|
||||
信心度: {(documentAnalysis.confidence * 100).toFixed(0)}%
|
||||
</span>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</CardContent>
|
||||
</Card>
|
||||
{/* Processing Track Selector - Always show after analysis */}
|
||||
{!isAnalyzing && (
|
||||
<ProcessingTrackSelector
|
||||
value={forceTrack}
|
||||
onChange={setForceTrack}
|
||||
documentAnalysis={documentAnalysis}
|
||||
disabled={processOCRMutation.isPending}
|
||||
/>
|
||||
)}
|
||||
|
||||
{/* OCR Track Options - Only show when document needs OCR */}
|
||||
|
||||
240
openspec/changes/refactor-dual-track-architecture/design.md
Normal file
240
openspec/changes/refactor-dual-track-architecture/design.md
Normal file
@@ -0,0 +1,240 @@
|
||||
# Design: Refactor Dual-Track Architecture
|
||||
|
||||
## Context
|
||||
|
||||
Tool_OCR 是一個雙軌制文件處理系統,支援:
|
||||
- **Direct Track**: 從可編輯 PDF 直接提取結構化內容
|
||||
- **OCR Track**: 使用 PaddleOCR + PP-StructureV3 進行光學字符識別
|
||||
|
||||
目前系統存在以下技術債務:
|
||||
- OCRService (2,326 行) 承擔過多職責
|
||||
- PDFGeneratorService (4,644 行) 是單體服務
|
||||
- 記憶體管理分散在多個組件中
|
||||
- 已知 bug 影響輸出品質
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
### Goals
|
||||
- 修復 PLAN.md 中列出的所有已知 bug
|
||||
- 將 OCRService 拆分為 < 800 行的可維護單元
|
||||
- 將 PDFGeneratorService 拆分為 < 2,000 行
|
||||
- 簡化記憶體管理配置
|
||||
- 提升前端狀態管理一致性
|
||||
|
||||
### Non-Goals
|
||||
- 不改變現有 API 契約
|
||||
- 不引入新的外部依賴
|
||||
- 不改變資料庫 schema
|
||||
- 不改變使用者介面
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: 使用 PyMuPDF find_tables() 取代自定義表格檢測
|
||||
|
||||
**選擇**: 使用 PyMuPDF 內建的 `page.find_tables()` API
|
||||
|
||||
**理由**:
|
||||
- PyMuPDF 的表格檢測能正確識別合併單元格
|
||||
- 返回的 `table.cells` 結構包含 span 資訊
|
||||
- 減少自定義代碼維護負擔
|
||||
|
||||
**替代方案**:
|
||||
- 改進 `_detect_tables_by_position()` 算法
|
||||
- 優點:不依賴外部 API 變更
|
||||
- 缺點:複雜度高,難以處理所有邊界情況
|
||||
- 使用 Camelot 或 Tabula
|
||||
- 優點:成熟的表格提取庫
|
||||
- 缺點:引入新依賴,增加系統複雜度
|
||||
|
||||
### Decision 2: 使用 Strategy Pattern 重構服務層
|
||||
|
||||
**選擇**: 引入 ProcessingOrchestrator 使用策略模式
|
||||
|
||||
```python
|
||||
class ProcessingPipeline(Protocol):
|
||||
def process(self, file_path: str, options: ProcessingOptions) -> UnifiedDocument:
|
||||
...
|
||||
|
||||
class DirectPipeline(ProcessingPipeline):
|
||||
def __init__(self, extraction_engine: DirectExtractionEngine):
|
||||
self.engine = extraction_engine
|
||||
|
||||
def process(self, file_path, options):
|
||||
return self.engine.extract(file_path)
|
||||
|
||||
class OCRPipeline(ProcessingPipeline):
|
||||
def __init__(self, ocr_service: OCRService, preprocessor: LayoutPreprocessingService):
|
||||
self.ocr = ocr_service
|
||||
self.preprocessor = preprocessor
|
||||
|
||||
def process(self, file_path, options):
|
||||
# Preprocessing + OCR + Conversion
|
||||
...
|
||||
|
||||
class ProcessingOrchestrator:
|
||||
def __init__(self, detector: DocumentTypeDetector, pipelines: dict[str, ProcessingPipeline]):
|
||||
self.detector = detector
|
||||
self.pipelines = pipelines
|
||||
|
||||
def process(self, file_path, options):
|
||||
track = options.force_track or self.detector.detect(file_path).track
|
||||
return self.pipelines[track].process(file_path, options)
|
||||
```
|
||||
|
||||
**理由**:
|
||||
- 職責分離:檢測、處理、轉換各自獨立
|
||||
- 易於測試:可以單獨測試每個 Pipeline
|
||||
- 易於擴展:新增處理方式只需添加新 Pipeline
|
||||
|
||||
**替代方案**:
|
||||
- 使用 Chain of Responsibility
|
||||
- 優點:更靈活的處理鏈
|
||||
- 缺點:對於二選一的場景過於複雜
|
||||
- 保持現狀,只做代碼整理
|
||||
- 優點:風險最低
|
||||
- 缺點:無法解決根本問題
|
||||
|
||||
### Decision 3: 分層提取 PDF 生成邏輯
|
||||
|
||||
**選擇**: 將 PDFGeneratorService 拆分為三個模組
|
||||
|
||||
```
|
||||
PDFGeneratorService (主要編排)
|
||||
├── PDFTableRenderer (表格渲染)
|
||||
│ ├── HTMLTableParser (HTML 表格解析)
|
||||
│ └── CellRenderer (單元格渲染)
|
||||
├── PDFFontManager (字體管理)
|
||||
│ ├── FontLoader (字體載入)
|
||||
│ └── FontFallback (字體 fallback)
|
||||
└── PDFLayoutEngine (版面配置)
|
||||
```
|
||||
|
||||
**理由**:
|
||||
- 單一職責:每個模組專注一件事
|
||||
- 可重用:FontManager 可被其他服務使用
|
||||
- 易於測試:表格渲染可獨立測試
|
||||
|
||||
### Decision 4: 統一記憶體策略引擎
|
||||
|
||||
**選擇**: 合併記憶體管理組件為單一 MemoryPolicyEngine
|
||||
|
||||
```python
|
||||
class MemoryPolicyEngine:
|
||||
"""統一的記憶體策略引擎"""
|
||||
|
||||
def __init__(self, config: MemoryConfig):
|
||||
self.config = config
|
||||
self._semaphore = asyncio.Semaphore(config.max_concurrent_predictions)
|
||||
|
||||
@property
|
||||
def gpu_usage_percent(self) -> float:
|
||||
# 統一的 GPU 使用率查詢
|
||||
...
|
||||
|
||||
def check_availability(self) -> MemoryStatus:
|
||||
# 返回 AVAILABLE, WARNING, CRITICAL, EMERGENCY
|
||||
...
|
||||
|
||||
async def acquire_prediction_slot(self):
|
||||
# 統一的並發控制
|
||||
...
|
||||
|
||||
def cleanup_if_needed(self):
|
||||
# 根據狀態自動清理
|
||||
...
|
||||
|
||||
@dataclass
|
||||
class MemoryConfig:
|
||||
warning_threshold: float = 0.80 # 80%
|
||||
critical_threshold: float = 0.95 # 95%
|
||||
max_concurrent_predictions: int = 2
|
||||
model_idle_timeout: int = 300 # 5 minutes
|
||||
```
|
||||
|
||||
**理由**:
|
||||
- 減少配置項:從 8+ 降到 4 個核心配置
|
||||
- 簡化依賴:服務只需依賴一個記憶體引擎
|
||||
- 統一行為:所有記憶體決策在同一處做出
|
||||
|
||||
### Decision 5: 使用 Zustand 管理任務狀態
|
||||
|
||||
**選擇**: 新增 TaskStore 統一管理任務狀態
|
||||
|
||||
```typescript
|
||||
interface TaskState {
|
||||
currentTaskId: string | null;
|
||||
tasks: Record<string, TaskDetail>;
|
||||
processingStatus: Record<string, ProcessingStatus>;
|
||||
}
|
||||
|
||||
interface TaskActions {
|
||||
setCurrentTask: (taskId: string) => void;
|
||||
updateTask: (taskId: string, updates: Partial<TaskDetail>) => void;
|
||||
updateProcessingStatus: (taskId: string, status: ProcessingStatus) => void;
|
||||
clearTasks: () => void;
|
||||
}
|
||||
|
||||
const useTaskStore = create<TaskState & TaskActions>()(
|
||||
persist(
|
||||
(set) => ({
|
||||
currentTaskId: null,
|
||||
tasks: {},
|
||||
processingStatus: {},
|
||||
// ... actions
|
||||
}),
|
||||
{ name: 'task-storage' }
|
||||
)
|
||||
);
|
||||
```
|
||||
|
||||
**理由**:
|
||||
- 一致性:與現有 uploadStore、authStore 模式一致
|
||||
- 可追蹤:任務狀態變更集中管理
|
||||
- 持久化:刷新頁面後狀態保留
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
| 風險 | 影響 | 緩解措施 |
|
||||
|------|------|----------|
|
||||
| PyMuPDF find_tables() API 變更 | 中 | 封裝為獨立函數,易於替換 |
|
||||
| 服務重構導致處理邏輯錯誤 | 高 | 保留原有測試,逐步重構 |
|
||||
| 記憶體引擎改變導致 OOM | 高 | 使用相同閾值,僅改變代碼結構 |
|
||||
| 前端狀態遷移導致 bug | 中 | 逐頁遷移,完整測試每個頁面 |
|
||||
|
||||
## Migration Plan
|
||||
|
||||
### Step 1: Bug Fixes (可獨立部署)
|
||||
1. 實現 PyMuPDF find_tables() 整合
|
||||
2. 修復 OCR Track 圖片路徑
|
||||
3. 添加 cell_boxes 座標驗證
|
||||
4. 測試並部署
|
||||
|
||||
### Step 2: Service Refactoring (可獨立部署)
|
||||
1. 提取 ProcessingOrchestrator
|
||||
2. 提取 TableRenderer 和 FontManager
|
||||
3. 更新 OCRService 使用新組件
|
||||
4. 測試並部署
|
||||
|
||||
### Step 3: Memory Management (可獨立部署)
|
||||
1. 實現 MemoryPolicyEngine
|
||||
2. 逐步遷移服務使用新引擎
|
||||
3. 移除舊組件
|
||||
4. 測試並部署
|
||||
|
||||
### Step 4: Frontend Improvements (可獨立部署)
|
||||
1. 新增 TaskStore
|
||||
2. 遷移 ProcessingPage
|
||||
3. 遷移 TaskDetailPage
|
||||
4. 合併類型定義
|
||||
5. 測試並部署
|
||||
|
||||
### Rollback Plan
|
||||
- 每個 Step 獨立部署,問題時可回滾到上一個穩定版本
|
||||
- Bug fixes 優先,確保基本功能正確
|
||||
- 重構不改變外部行為,回滾影響最小
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **PyMuPDF find_tables() 的版本相容性**: 需確認目前使用的 PyMuPDF 版本是否支援此 API
|
||||
2. **前端狀態持久化範圍**: 是否所有任務都需要持久化,還是只保留當前會話?
|
||||
3. **記憶體閾值調整**: 現有閾值是否經過生產驗證,可以直接沿用?
|
||||
@@ -0,0 +1,68 @@
|
||||
# Change: Refactor Dual-Track Architecture
|
||||
|
||||
## Why
|
||||
|
||||
目前雙軌制 OCR 系統存在多個已知問題和架構債務:
|
||||
|
||||
1. **Direct Track 表格問題**: `_detect_tables_by_position()` 無法識別合併單元格,導致 edit3.pdf 產生 204 個錯誤拆分的 cells(應為 83 個)
|
||||
2. **OCR Track 圖片路徑丟失**: CHART/DIAGRAM 等視覺元素的 `saved_path` 在轉換時丟失,導致圖片未放回 PDF
|
||||
3. **OCR Track cell_boxes 座標錯亂**: PP-StructureV3 返回的 cell_boxes 超出頁面邊界
|
||||
4. **服務層過度複雜**: OCRService (2,326 行) 承擔過多職責,難以維護和測試
|
||||
5. **PDF 生成器過於龐大**: PDFGeneratorService (4,644 行) 是單體服務,難以擴展
|
||||
|
||||
## What Changes
|
||||
|
||||
### Phase 1: 修復已知 Bug(優先級:最高)
|
||||
|
||||
- **Direct Track 表格修復**: 改用 PyMuPDF `find_tables()` API 取代 `_detect_tables_by_position()`
|
||||
- **OCR Track 圖片路徑修復**: 擴展 `_convert_pp3_element` 處理所有視覺元素類型 (IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP)
|
||||
- **Cell boxes 座標驗證**: 添加邊界檢查,超出範圍時使用 CV 線檢測 fallback
|
||||
- **過濾極小裝飾圖片**: 過濾 < 200 px² 的圖片
|
||||
- **移除覆蓋圖像**: 在渲染階段過濾與 covering_images 重疊的圖片
|
||||
|
||||
### Phase 2: 服務層重構(優先級:高)
|
||||
|
||||
- **拆分 OCRService**: 提取獨立的 `ProcessingOrchestrator` 負責流程編排
|
||||
- **建立 Pipeline 模式**: 使用組合模式取代目前的聚合模式
|
||||
- **提取 TableRenderer**: 從 PDFGeneratorService 提取表格渲染邏輯
|
||||
- **提取 FontManager**: 從 PDFGeneratorService 提取字體管理邏輯
|
||||
|
||||
### Phase 3: 記憶體管理簡化(優先級:中)
|
||||
|
||||
- **統一記憶體策略**: 合併 MemoryManager、MemoryGuard、各類 Semaphore 為單一策略引擎
|
||||
- **簡化配置**: 減少 8+ 個記憶體相關配置項到核心 3-4 項
|
||||
|
||||
### Phase 4: 前端狀態管理改進(優先級:中)
|
||||
|
||||
- **新增 TaskStore**: 使用 Zustand 管理任務狀態,取代分散的 useState
|
||||
- **合併類型定義**: 統一 api.ts 和 apiV2.ts 為單一類型定義檔案
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected specs: `document-processing`
|
||||
- Affected code:
|
||||
- `backend/app/services/direct_extraction_engine.py` (表格檢測)
|
||||
- `backend/app/services/ocr_to_unified_converter.py` (元素轉換)
|
||||
- `backend/app/services/ocr_service.py` (服務編排)
|
||||
- `backend/app/services/pdf_generator_service.py` (PDF 生成)
|
||||
- `backend/app/services/memory_manager.py` (記憶體管理)
|
||||
- `frontend/src/store/` (狀態管理)
|
||||
- `frontend/src/types/` (類型定義)
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
| 風險 | 嚴重性 | 緩解措施 |
|
||||
|------|--------|----------|
|
||||
| 表格渲染回歸 | 高 | 使用 edit.pdf 和 edit3.pdf 作為回歸測試 |
|
||||
| 記憶體管理變更導致 OOM | 高 | 保留現有閾值,僅重構代碼結構 |
|
||||
| 服務重構導致處理失敗 | 中 | 逐步重構,每階段完整測試 |
|
||||
|
||||
## Success Metrics
|
||||
|
||||
| 指標 | 目前 | 目標 |
|
||||
|------|------|------|
|
||||
| edit3.pdf Direct Track cells | 204 (錯誤) | 83 (正確) |
|
||||
| OCR Track 圖片放回率 | 0% | 100% |
|
||||
| cell_boxes 座標正確率 | ~40% | 100% |
|
||||
| OCRService 行數 | 2,326 | < 800 |
|
||||
| PDFGeneratorService 行數 | 4,644 | < 2,000 |
|
||||
@@ -0,0 +1,151 @@
|
||||
# document-processing Specification Delta
|
||||
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Table Cell Merging Detection
|
||||
The system SHALL correctly detect and preserve merged cells (rowspan/colspan) when extracting tables from PDF documents.
|
||||
|
||||
#### Scenario: Detect merged cells in Direct Track
|
||||
- **WHEN** extracting tables from an editable PDF using Direct Track
|
||||
- **THEN** the system SHALL use PyMuPDF find_tables() API
|
||||
- **AND** correctly identify cells with rowspan > 1 or colspan > 1
|
||||
- **AND** preserve merge information in UnifiedDocument table structure
|
||||
- **AND** skip placeholder cells that are covered by merged cells
|
||||
|
||||
#### Scenario: Handle complex table structures
|
||||
- **WHEN** processing a table with mixed merged and regular cells (e.g., edit3.pdf with 83 cells including 121 merges)
|
||||
- **THEN** the system SHALL NOT split merged cells into individual cells
|
||||
- **AND** the output cell count SHALL match the actual visual cell count
|
||||
- **AND** the rendered PDF SHALL display correct merged cell boundaries
|
||||
|
||||
### Requirement: Visual Element Path Preservation
|
||||
The system SHALL preserve image paths for all visual element types during OCR conversion.
|
||||
|
||||
#### Scenario: Preserve CHART element paths
|
||||
- **WHEN** converting PP-StructureV3 output containing CHART elements
|
||||
- **THEN** the system SHALL treat CHART as a visual element type
|
||||
- **AND** extract saved_path from the element data
|
||||
- **AND** include saved_path in the UnifiedDocument content field
|
||||
|
||||
#### Scenario: Support all visual element types
|
||||
- **WHEN** processing visual elements of types IMAGE, FIGURE, CHART, DIAGRAM, LOGO, or STAMP
|
||||
- **THEN** the system SHALL extract saved_path or img_path for each element
|
||||
- **AND** preserve path, width, height, and format in content dictionary
|
||||
- **AND** enable downstream PDF generation to embed these images
|
||||
|
||||
#### Scenario: Fallback path resolution
|
||||
- **WHEN** a visual element has multiple path fields (saved_path, img_path)
|
||||
- **THEN** the system SHALL prefer saved_path over img_path
|
||||
- **AND** fallback to img_path if saved_path is missing
|
||||
- **AND** log warning if both paths are missing
|
||||
|
||||
### Requirement: Cell Box Coordinate Validation
|
||||
The system SHALL validate cell box coordinates from PP-StructureV3 and handle out-of-bounds cases.
|
||||
|
||||
#### Scenario: Detect out-of-bounds coordinates
|
||||
- **WHEN** processing cell_boxes from PP-StructureV3
|
||||
- **THEN** the system SHALL validate each coordinate against page boundaries (0, 0, page_width, page_height)
|
||||
- **AND** log tables with coordinates exceeding page bounds
|
||||
- **AND** mark affected cells for fallback processing
|
||||
|
||||
#### Scenario: Apply CV line detection fallback
|
||||
- **WHEN** cell_boxes coordinates are invalid (out of bounds)
|
||||
- **THEN** the system SHALL apply OpenCV line detection as fallback
|
||||
- **AND** reconstruct table structure from detected lines
|
||||
- **AND** include fallback_used flag in table metadata
|
||||
|
||||
#### Scenario: Coordinate normalization
|
||||
- **WHEN** coordinates are within page bounds but slightly outside table bbox
|
||||
- **THEN** the system SHALL clamp coordinates to table boundaries
|
||||
- **AND** preserve relative cell positions
|
||||
- **AND** ensure no cells overlap after normalization
|
||||
|
||||
### Requirement: Decoration Image Filtering
|
||||
The system SHALL filter out minimal decoration images that do not contribute meaningful content.
|
||||
|
||||
#### Scenario: Filter tiny images by area
|
||||
- **WHEN** extracting images from a document
|
||||
- **THEN** the system SHALL calculate image area (width x height)
|
||||
- **AND** filter out images with area < 200 square pixels
|
||||
- **AND** log filtered image count for debugging
|
||||
|
||||
#### Scenario: Configurable filtering threshold
|
||||
- **WHEN** processing documents with intentionally small images
|
||||
- **THEN** the system SHALL support configuration of minimum image area threshold
|
||||
- **AND** default to 200 square pixels if not specified
|
||||
- **AND** allow threshold = 0 to disable filtering
|
||||
|
||||
### Requirement: Covering Image Removal
|
||||
The system SHALL remove covering/redaction images from the final output.
|
||||
|
||||
#### Scenario: Detect covering rectangles
|
||||
- **WHEN** preprocessing a PDF page
|
||||
- **THEN** the system SHALL detect black/white rectangles covering text regions
|
||||
- **AND** identify covering images by high IoU (> 0.8) with underlying content
|
||||
- **AND** mark covering images for exclusion
|
||||
|
||||
#### Scenario: Exclude covering images from rendering
|
||||
- **WHEN** generating output PDF
|
||||
- **THEN** the system SHALL exclude images marked as covering
|
||||
- **AND** preserve the text content that was covered
|
||||
- **AND** include covering_images_removed count in metadata
|
||||
|
||||
#### Scenario: Handle both black and white covering
|
||||
- **WHEN** detecting covering rectangles
|
||||
- **THEN** the system SHALL detect both black fill (redaction style)
|
||||
- **AND** white fill (whiteout style)
|
||||
- **AND** low-contrast rectangles intended to hide content
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Enhanced OCR with Full PP-StructureV3
|
||||
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
|
||||
|
||||
#### Scenario: Extract comprehensive document structure
|
||||
- **WHEN** processing through OCR track
|
||||
- **THEN** the system SHALL use page_result.json['parsing_res_list']
|
||||
- **AND** extract all element types including headers, lists, tables, figures
|
||||
- **AND** preserve layout_bbox coordinates for each element
|
||||
|
||||
#### Scenario: Maintain reading order
|
||||
- **WHEN** extracting elements from PP-StructureV3
|
||||
- **THEN** the system SHALL preserve the reading order from parsing_res_list
|
||||
- **AND** assign sequential indices to elements
|
||||
- **AND** support reordering for complex layouts
|
||||
|
||||
#### Scenario: Extract table structure
|
||||
- **WHEN** PP-StructureV3 identifies a table
|
||||
- **THEN** the system SHALL extract cell content and boundaries
|
||||
- **AND** validate cell_boxes coordinates against page boundaries
|
||||
- **AND** apply fallback detection for invalid coordinates
|
||||
- **AND** preserve table HTML for structure
|
||||
- **AND** extract plain text for translation
|
||||
|
||||
#### Scenario: Extract visual elements with paths
|
||||
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
|
||||
- **THEN** the system SHALL preserve saved_path for each element
|
||||
- **AND** include image dimensions and format
|
||||
- **AND** enable image embedding in output PDF
|
||||
|
||||
### Requirement: Generate UnifiedDocument from direct extraction
|
||||
The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
|
||||
|
||||
#### Scenario: Extract tables with cell merging
|
||||
- **WHEN** direct extraction encounters a table
|
||||
- **THEN** the system SHALL use PyMuPDF find_tables() API
|
||||
- **AND** extract cell content with correct rowspan/colspan
|
||||
- **AND** preserve merged cell boundaries
|
||||
- **AND** skip placeholder cells covered by merges
|
||||
|
||||
#### Scenario: Filter decoration images
|
||||
- **WHEN** extracting images from PDF
|
||||
- **THEN** the system SHALL filter images smaller than minimum area threshold
|
||||
- **AND** exclude covering/redaction images
|
||||
- **AND** preserve meaningful content images
|
||||
|
||||
#### Scenario: Preserve text styling with image handling
|
||||
- **WHEN** direct extraction completes
|
||||
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
|
||||
- **AND** preserve text styling, fonts, and exact positioning
|
||||
- **AND** extract tables with cell boundaries, content, and merge info
|
||||
- **AND** include only meaningful images in output
|
||||
108
openspec/changes/refactor-dual-track-architecture/tasks.md
Normal file
108
openspec/changes/refactor-dual-track-architecture/tasks.md
Normal file
@@ -0,0 +1,108 @@
|
||||
# Tasks: Refactor Dual-Track Architecture
|
||||
|
||||
## Phase 1: 修復已知 Bug (已完成)
|
||||
|
||||
### 1.1 Direct Track 表格修復 (已完成 ✓)
|
||||
- [x] 1.1.1 修改 `_process_native_table()` 方法使用 `table.cells` 處理合併單元格
|
||||
- [x] 1.1.2 使用 PyMuPDF `page.find_tables()` API (已在使用中)
|
||||
- [x] 1.1.3 解析 `table.cells` 並正確計算 `row_span`/`col_span`
|
||||
- [x] 1.1.4 處理被合併的單元格(跳過 `None` 值,建立 covered grid)
|
||||
- [x] 1.1.5 驗證 edit3.pdf 返回 83 個正確的 cells ✓
|
||||
|
||||
### 1.2 OCR Track 圖片路徑修復 (已完成 ✓)
|
||||
- [x] 1.2.1 修改 `ocr_to_unified_converter.py` 第 604-613 行
|
||||
- [x] 1.2.2 擴展視覺元素類型判斷:`IMAGE, FIGURE, CHART, DIAGRAM, LOGO, STAMP`
|
||||
- [x] 1.2.3 優先使用 `saved_path`,fallback 到 `img_path`
|
||||
- [x] 1.2.4 確保 content dict 包含 `saved_path`, `path`, `width`, `height`, `format`
|
||||
- [x] 1.2.5 程式碼已修正 (需 OCR Track 完整測試驗證)
|
||||
- [x] 1.2.6 程式碼已修正 (需 OCR Track 完整測試驗證)
|
||||
|
||||
### 1.3 Cell boxes 座標驗證 (已完成 ✓)
|
||||
- [x] 1.3.1 在 `ocr_to_unified_converter.py` 添加 `validate_cell_boxes()` 函數
|
||||
- [x] 1.3.2 檢查 cell_boxes 是否超出頁面邊界 (0, 0, page_width, page_height)
|
||||
- [x] 1.3.3 超出範圍時使用 clamped coordinates,標記 needs_fallback
|
||||
- [x] 1.3.4 添加日誌記錄異常座標
|
||||
- [x] 1.3.5 單元測試驗證座標驗證邏輯正確 ✓
|
||||
|
||||
### 1.4 過濾極小裝飾圖片 (已完成 ✓)
|
||||
- [x] 1.4.1 在 `direct_extraction_engine.py` 圖片提取邏輯添加面積檢查
|
||||
- [x] 1.4.2 過濾 `image_area < min_image_area` (默認 200 px²) 的圖片
|
||||
- [x] 1.4.3 添加 `min_image_area` 配置項允許調整閾值
|
||||
- [x] 1.4.4 驗證 edit3.pdf 偵測到 3 個極小裝飾圖片 ✓
|
||||
|
||||
### 1.5 移除覆蓋圖像 (已完成 ✓)
|
||||
- [x] 1.5.1 傳遞 `covering_images` 到 `_extract_images()` 方法
|
||||
- [x] 1.5.2 使用 IoU 閾值 (0.8) 和 xref 比對判斷覆蓋圖像
|
||||
- [x] 1.5.3 從最終輸出中排除覆蓋圖像
|
||||
- [x] 1.5.4 添加 `_calculate_iou()` 輔助方法
|
||||
- [x] 1.5.5 驗證 edit3.pdf 偵測到 6 個黑框覆蓋圖像 ✓
|
||||
|
||||
## Phase 2: 服務層重構
|
||||
|
||||
### 2.1 提取 ProcessingOrchestrator
|
||||
- [ ] 2.1.1 建立 `backend/app/services/processing_orchestrator.py`
|
||||
- [ ] 2.1.2 從 OCRService 提取流程編排邏輯
|
||||
- [ ] 2.1.3 定義 `ProcessingPipeline` 介面
|
||||
- [ ] 2.1.4 實現 DirectPipeline 和 OCRPipeline
|
||||
- [ ] 2.1.5 更新 OCRService 使用 ProcessingOrchestrator
|
||||
- [ ] 2.1.6 確保現有功能不受影響
|
||||
|
||||
### 2.2 提取 TableRenderer
|
||||
- [ ] 2.2.1 建立 `backend/app/services/pdf_table_renderer.py`
|
||||
- [ ] 2.2.2 從 PDFGeneratorService 提取 HTMLTableParser
|
||||
- [ ] 2.2.3 提取表格渲染邏輯到獨立類
|
||||
- [ ] 2.2.4 支援合併單元格渲染
|
||||
- [ ] 2.2.5 更新 PDFGeneratorService 使用 TableRenderer
|
||||
|
||||
### 2.3 提取 FontManager
|
||||
- [ ] 2.3.1 建立 `backend/app/services/pdf_font_manager.py`
|
||||
- [ ] 2.3.2 提取字體載入和快取邏輯
|
||||
- [ ] 2.3.3 提取 CJK 字體支援邏輯
|
||||
- [ ] 2.3.4 實現字體 fallback 機制
|
||||
- [ ] 2.3.5 更新 PDFGeneratorService 使用 FontManager
|
||||
|
||||
## Phase 3: 記憶體管理簡化
|
||||
|
||||
### 3.1 統一記憶體策略引擎
|
||||
- [ ] 3.1.1 建立 `backend/app/services/memory_policy_engine.py`
|
||||
- [ ] 3.1.2 定義統一的記憶體策略介面
|
||||
- [ ] 3.1.3 合併 MemoryManager 和 MemoryGuard 邏輯
|
||||
- [ ] 3.1.4 整合 Semaphore 管理
|
||||
- [ ] 3.1.5 簡化配置到 3-4 個核心項目
|
||||
|
||||
### 3.2 更新服務使用新記憶體引擎
|
||||
- [ ] 3.2.1 更新 OCRService 使用 MemoryPolicyEngine
|
||||
- [ ] 3.2.2 更新 ServicePool 使用 MemoryPolicyEngine
|
||||
- [ ] 3.2.3 移除舊的 MemoryGuard 引用
|
||||
- [ ] 3.2.4 驗證 GPU 記憶體監控正常運作
|
||||
|
||||
## Phase 4: 前端狀態管理改進
|
||||
|
||||
### 4.1 新增 TaskStore
|
||||
- [ ] 4.1.1 建立 `frontend/src/store/taskStore.ts`
|
||||
- [ ] 4.1.2 定義任務狀態結構(currentTask, tasks, processingStatus)
|
||||
- [ ] 4.1.3 實現 CRUD 操作和狀態轉換
|
||||
- [ ] 4.1.4 添加 localStorage 持久化
|
||||
- [ ] 4.1.5 更新 ProcessingPage 使用 TaskStore
|
||||
- [ ] 4.1.6 更新 TaskDetailPage 使用 TaskStore
|
||||
|
||||
### 4.2 合併類型定義
|
||||
- [ ] 4.2.1 審查 `api.ts` 和 `apiV2.ts` 的差異
|
||||
- [ ] 4.2.2 合併類型定義到 `apiV2.ts`
|
||||
- [ ] 4.2.3 移除 `api.ts` 中的重複定義
|
||||
- [ ] 4.2.4 更新所有 import 路徑
|
||||
- [ ] 4.2.5 驗證 TypeScript 編譯無錯誤
|
||||
|
||||
## Phase 5: 測試與驗證
|
||||
|
||||
### 5.1 回歸測試
|
||||
- [ ] 5.1.1 使用 edit.pdf 測試 Direct Track(確保無回歸)
|
||||
- [ ] 5.1.2 使用 edit3.pdf 測試 Direct Track 表格合併
|
||||
- [ ] 5.1.3 使用 edit.pdf 測試 OCR Track 圖片放回
|
||||
- [ ] 5.1.4 使用 edit3.pdf 測試 OCR Track 圖片放回
|
||||
- [ ] 5.1.5 驗證所有 cell_boxes 座標正確
|
||||
|
||||
### 5.2 效能測試
|
||||
- [ ] 5.2.1 測量重構後的處理時間
|
||||
- [ ] 5.2.2 驗證記憶體使用無明顯增加
|
||||
- [ ] 5.2.3 驗證 GPU 使用率正常
|
||||
Reference in New Issue
Block a user