chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions
--- a/openspec/changes/fix-table-column-alignment/design.md
+++ b/openspec/changes/fix-table-column-alignment/design.md
@@ -0,0 +1,227 @@
+# Design: Table Column Alignment Correction
+
+## Context
+
+PP-Structure v3's table structure recognition model outputs HTML with row/col attributes inferred from visual patterns. However, the model frequently assigns incorrect column indices, especially for:
+- Tables with unclear left borders
+- Cells containing vertical Chinese text
+- Complex merged cells
+
+This design introduces a **post-processing correction layer** that validates and fixes column assignments using geometric coordinates.
+
+## Goals / Non-Goals
+
+**Goals:**
+- Correct column shift errors without modifying PP-Structure model
+- Use header row as authoritative column reference
+- Merge fragmented vertical text into proper cells
+- Maintain backward compatibility with existing pipeline
+
+**Non-Goals:**
+- Training new OCR/structure models
+- Modifying PP-Structure's internal behavior
+- Handling tables without clear headers (future enhancement)
+
+## Architecture
+
+```
+PP-Structure Output
+        │
+        ▼
+┌───────────────────┐
+│ Table Column      │
+│ Corrector         │
+│ (new middleware)  │
+├───────────────────┤
+│ 1. Extract header │
+│    column ranges  │
+│ 2. Validate cells │
+│ 3. Correct col    │
+│    assignments    │
+└───────────────────┘
+        │
+        ▼
+   PDF Generator
+```
+
+## Decisions
+
+### Decision 1: Header-Anchor Algorithm
+
+**Approach:** Use first row (row_idx=0) cells as column anchors.
+
+**Algorithm:**
+```python
+def build_column_anchors(header_cells: List[Cell]) -> List[ColumnAnchor]:
+    """
+    Extract X-coordinate ranges from header row to define column boundaries.
+
+    Returns:
+        List of ColumnAnchor(col_idx, x_min, x_max)
+    """
+    anchors = []
+    for cell in header_cells:
+        anchors.append(ColumnAnchor(
+            col_idx=cell.col_idx,
+            x_min=cell.bbox.x0,
+            x_max=cell.bbox.x1
+        ))
+    return sorted(anchors, key=lambda a: a.x_min)
+
+
+def correct_column(cell: Cell, anchors: List[ColumnAnchor]) -> int:
+    """
+    Find the correct column index based on X-coordinate overlap.
+
+    Strategy:
+    1. Calculate overlap with each column anchor
+    2. If overlap > 50% with different column, correct it
+    3. If no overlap, find nearest column by center point
+    """
+    cell_center_x = (cell.bbox.x0 + cell.bbox.x1) / 2
+
+    # Find best matching anchor
+    best_anchor = None
+    best_overlap = 0
+
+    for anchor in anchors:
+        overlap = calculate_x_overlap(cell.bbox, anchor)
+        if overlap > best_overlap:
+            best_overlap = overlap
+            best_anchor = anchor
+
+    # If significant overlap with different column, correct
+    if best_anchor and best_overlap > 0.5:
+        if best_anchor.col_idx != cell.col_idx:
+            logger.info(f"Correcting cell col {cell.col_idx} -> {best_anchor.col_idx}")
+            return best_anchor.col_idx
+
+    return cell.col_idx
+```
+
+**Why this approach:**
+- Headers are typically the most accurately recognized row
+- X-coordinates are objective measurements, not semantic inference
+- Simple O(n*m) complexity (n cells, m columns)
+
+### Decision 2: Vertical Fragment Merging
+
+**Detection criteria for vertical text fragments:**
+1. Width << Height (aspect ratio < 0.3)
+2. Located in leftmost 15% of table
+3. X-center deviation < 10px between consecutive blocks
+4. Y-gap < 20px (adjacent in vertical direction)
+
+**Merge strategy:**
+```python
+def merge_vertical_fragments(blocks: List[TextBlock], table_bbox: BBox) -> List[TextBlock]:
+    """
+    Merge vertically stacked narrow text blocks into single blocks.
+    """
+    # Filter candidates: narrow blocks in left margin
+    left_boundary = table_bbox.x0 + (table_bbox.width * 0.15)
+    candidates = [b for b in blocks
+                  if b.width < b.height * 0.3
+                  and b.center_x < left_boundary]
+
+    # Sort by Y position
+    candidates.sort(key=lambda b: b.y0)
+
+    # Merge adjacent blocks
+    merged = []
+    current_group = []
+
+    for block in candidates:
+        if not current_group:
+            current_group.append(block)
+        elif should_merge(current_group[-1], block):
+            current_group.append(block)
+        else:
+            merged.append(merge_group(current_group))
+            current_group = [block]
+
+    if current_group:
+        merged.append(merge_group(current_group))
+
+    return merged
+```
+
+### Decision 3: Data Sources
+
+**Primary source:** `cell_boxes` from PP-Structure
+- Contains accurate geometric coordinates for each detected cell
+- Independent of HTML structure recognition
+
+**Secondary source:** HTML content with row/col attributes
+- Contains text content and structure
+- May have incorrect col assignments (the problem we're fixing)
+
+**Correlation:** Match HTML cells to cell_boxes using IoU (Intersection over Union):
+```python
+def match_html_cell_to_cellbox(html_cell: HtmlCell, cell_boxes: List[BBox]) -> Optional[BBox]:
+    """Find the cell_box that best matches this HTML cell's position."""
+    best_iou = 0
+    best_box = None
+
+    for box in cell_boxes:
+        iou = calculate_iou(html_cell.inferred_bbox, box)
+        if iou > best_iou:
+            best_iou = iou
+            best_box = box
+
+    return best_box if best_iou > 0.3 else None
+```
+
+## Configuration
+
+```python
+# config.py additions
+table_column_correction_enabled: bool = Field(
+    default=True,
+    description="Enable header-anchor column correction"
+)
+table_column_correction_threshold: float = Field(
+    default=0.5,
+    description="Minimum X-overlap ratio to trigger column correction"
+)
+vertical_fragment_merge_enabled: bool = Field(
+    default=True,
+    description="Enable vertical text fragment merging"
+)
+vertical_fragment_aspect_ratio: float = Field(
+    default=0.3,
+    description="Max width/height ratio to consider as vertical text"
+)
+```
+
+## Risks / Trade-offs
+
+| Risk | Mitigation |
+|------|------------|
+| Headers themselves misaligned | Fall back to original column assignments |
+| Multi-row headers | Support colspan detection in header extraction |
+| Tables without headers | Skip correction, use original structure |
+| Performance overhead | O(n*m) is negligible for typical table sizes |
+
+## Integration Points
+
+1. **Input:** PP-Structure's `table_res` containing:
+   - `cell_boxes`: List of [x0, y0, x1, y1] coordinates
+   - `html`: Table HTML with row/col attributes
+
+2. **Output:** Corrected table structure with:
+   - Updated col indices in HTML cells
+   - Merged vertical text blocks
+   - Diagnostic logs for corrections made
+
+3. **Trigger location:** After PP-Structure table recognition, before PDF generation
+   - File: `pdf_generator_service.py`
+   - Method: `draw_table_region()` or new preprocessing step
+
+## Open Questions
+
+1. **Q:** How to handle tables where header row itself is misaligned?
+   **A:** Could add a secondary validation using cell_boxes grid inference, but start simple.
+
+2. **Q:** Should corrections be logged for user review?
+   **A:** Yes, add detailed logging with before/after column indices.
--- a/openspec/changes/fix-table-column-alignment/proposal.md
+++ b/openspec/changes/fix-table-column-alignment/proposal.md
@@ -0,0 +1,56 @@
+# Change: Fix Table Column Alignment with Header-Anchor Correction
+
+## Why
+
+PP-Structure's table structure recognition frequently outputs cells with incorrect column indices, causing "column shift" where content appears in the wrong column. This happens because:
+
+1. **Semantic over Geometric**: The model infers row/col from semantic patterns rather than physical coordinates
+2. **Vertical text fragmentation**: Chinese vertical text (e.g., "报价内容") gets split into fragments
+3. **Missing left boundary**: When table's left border is unclear, cells shift left incorrectly
+
+The result: A cell with X-coordinate 213 gets assigned to column 0 (range 96-162) instead of column 1 (range 204-313).
+
+## What Changes
+
+- **Add Header-Anchor Alignment**: Use the first row (header) X-coordinates as column reference points
+- **Add Coordinate-Based Column Correction**: Validate and correct cell column assignments based on X-coordinate overlap with header columns
+- **Add Vertical Fragment Merging**: Detect and merge vertically stacked narrow text blocks that represent vertical text
+- **Add Configuration Options**: Enable/disable correction features independently
+
+## Impact
+
+- Affected specs: `document-processing`
+- Affected code:
+  - `backend/app/services/table_column_corrector.py` (new)
+  - `backend/app/services/pdf_generator_service.py`
+  - `backend/app/core/config.py`
+
+## Problem Analysis
+
+### Example: scan.pdf Table 7
+
+**Raw PP-Structure Output:**
+```
+Row 5: "3、適應產品..." at X=213
+       Model says: col=0
+
+Header Row 0:
+  - Column 0 (序號): X range [96, 162]
+  - Column 1 (產品名稱): X range [204, 313]
+```
+
+**Problem:** X=213 is far outside column 0's range (max 162), but perfectly within column 1's range (starts at 204).
+
+**Solution:** Force-correct col=0 → col=1 based on X-coordinate alignment with header.
+
+### Vertical Text Issue
+
+**Raw OCR:**
+```
+Block A: "报价内" at X≈100, Y=[100, 200]
+Block B: "容--"   at X≈102, Y=[200, 300]
+```
+
+**Problem:** These should be one cell spanning multiple rows, but appear as separate fragments.
+
+**Solution:** Merge vertically aligned narrow blocks before structure recognition.
--- a/openspec/changes/fix-table-column-alignment/specs/document-processing/spec.md
+++ b/openspec/changes/fix-table-column-alignment/specs/document-processing/spec.md
@@ -0,0 +1,59 @@
+## ADDED Requirements
+
+### Requirement: Table Column Alignment Correction
+The system SHALL correct table cell column assignments using header-anchor alignment when PP-Structure outputs incorrect column indices.
+
+#### Scenario: Correct column shift using header anchors
+- **WHEN** processing a table with cell_boxes and HTML content
+- **THEN** the system SHALL extract header row (row_idx=0) column X-coordinate ranges
+- **AND** validate each cell's column assignment against header X-ranges
+- **AND** correct column index if cell X-overlap with assigned column is < 50%
+- **AND** assign cell to column with highest X-overlap
+
+#### Scenario: Handle tables without headers
+- **WHEN** processing a table without a clear header row
+- **THEN** the system SHALL skip column correction
+- **AND** use original PP-Structure column assignments
+- **AND** log that header-anchor correction was skipped
+
+#### Scenario: Log column corrections
+- **WHEN** a cell's column index is corrected
+- **THEN** the system SHALL log original and corrected column indices
+- **AND** include cell content snippet for debugging
+- **AND** record total corrections per table
+
+### Requirement: Vertical Text Fragment Merging
+The system SHALL detect and merge vertically fragmented Chinese text blocks that represent single cells spanning multiple rows.
+
+#### Scenario: Detect vertical text fragments
+- **WHEN** processing table text regions
+- **THEN** the system SHALL identify narrow text blocks (width/height ratio < 0.3)
+- **AND** filter blocks in leftmost 15% of table area
+- **AND** group vertically adjacent blocks with X-center deviation < 10px
+
+#### Scenario: Merge fragmented vertical text
+- **WHEN** vertical text fragments are detected
+- **THEN** the system SHALL merge adjacent fragments into single text blocks
+- **AND** combine text content preserving reading order
+- **AND** calculate merged bounding box spanning all fragments
+- **AND** treat merged block as single cell for column assignment
+
+#### Scenario: Preserve non-vertical text
+- **WHEN** text blocks do not meet vertical fragment criteria
+- **THEN** the system SHALL preserve original text block boundaries
+- **AND** process normally without merging
+
+## MODIFIED Requirements
+
+### Requirement: Extract table structure
+The system SHALL extract cell content and boundaries from PP-StructureV3 tables, with post-processing correction for column alignment errors.
+
+#### Scenario: Extract table structure with correction
+- **WHEN** PP-StructureV3 identifies a table
+- **THEN** the system SHALL extract cell content and boundaries
+- **AND** validate cell_boxes coordinates against page boundaries
+- **AND** apply header-anchor column correction when enabled
+- **AND** merge vertical text fragments when enabled
+- **AND** apply fallback detection for invalid coordinates
+- **AND** preserve table HTML for structure
+- **AND** extract plain text for translation
--- a/openspec/changes/fix-table-column-alignment/tasks.md
+++ b/openspec/changes/fix-table-column-alignment/tasks.md
@@ -0,0 +1,59 @@
+## 1. Core Algorithm Implementation
+
+### 1.1 Table Column Corrector Module
+- [x] 1.1.1 Create `table_column_corrector.py` service file
+- [x] 1.1.2 Implement `ColumnAnchor` dataclass for header column ranges
+- [x] 1.1.3 Implement `build_column_anchors()` to extract header column X-ranges
+- [x] 1.1.4 Implement `calculate_x_overlap()` utility function
+- [x] 1.1.5 Implement `correct_cell_column()` for single cell correction
+- [x] 1.1.6 Implement `correct_table_columns()` main entry point
+
+### 1.2 HTML Cell Extraction
+- [x] 1.2.1 Implement `parse_table_html_with_positions()` to extract cells with row/col
+- [x] 1.2.2 Implement cell-to-cellbox matching using IoU
+- [x] 1.2.3 Handle colspan/rowspan in header detection
+
+### 1.3 Vertical Fragment Merging
+- [x] 1.3.1 Implement `detect_vertical_fragments()` to find narrow text blocks
+- [x] 1.3.2 Implement `should_merge_blocks()` adjacency check
+- [x] 1.3.3 Implement `merge_vertical_fragments()` main function
+- [x] 1.3.4 Integrate merged blocks back into table structure
+
+## 2. Configuration
+
+### 2.1 Settings
+- [x] 2.1.1 Add `table_column_correction_enabled: bool = True`
+- [x] 2.1.2 Add `table_column_correction_threshold: float = 0.5`
+- [x] 2.1.3 Add `vertical_fragment_merge_enabled: bool = True`
+- [x] 2.1.4 Add `vertical_fragment_aspect_ratio: float = 0.3`
+
+## 3. Integration
+
+### 3.1 Pipeline Integration
+- [x] 3.1.1 Add correction step in `pdf_generator_service.py` before table rendering
+- [x] 3.1.2 Pass corrected HTML to existing table rendering logic
+- [x] 3.1.3 Add diagnostic logging for corrections made
+
+### 3.2 Error Handling
+- [x] 3.2.1 Handle tables without headers gracefully
+- [x] 3.2.2 Handle empty/malformed cell_boxes
+- [x] 3.2.3 Fallback to original structure on correction failure
+
+## 4. Testing
+
+### 4.1 Unit Tests
+- [ ] 4.1.1 Test `build_column_anchors()` with various header configurations
+- [ ] 4.1.2 Test `correct_cell_column()` with known column shift cases
+- [ ] 4.1.3 Test `merge_vertical_fragments()` with vertical text samples
+- [ ] 4.1.4 Test edge cases: empty tables, single column, no headers
+
+### 4.2 Integration Tests
+- [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case)
+- [ ] 4.2.2 Test with tables that have correct alignment (no regression)
+- [ ] 4.2.3 Visual comparison of corrected vs original output
+
+## 5. Documentation
+
+- [x] 5.1 Add inline code comments explaining correction algorithm
+- [x] 5.2 Update spec with new table column correction requirement
+- [x] 5.3 Add logging messages for debugging