chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
227
openspec/changes/fix-table-column-alignment/design.md
Normal file
227
openspec/changes/fix-table-column-alignment/design.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# Design: Table Column Alignment Correction
|
||||
|
||||
## Context
|
||||
|
||||
PP-Structure v3's table structure recognition model outputs HTML with row/col attributes inferred from visual patterns. However, the model frequently assigns incorrect column indices, especially for:
|
||||
- Tables with unclear left borders
|
||||
- Cells containing vertical Chinese text
|
||||
- Complex merged cells
|
||||
|
||||
This design introduces a **post-processing correction layer** that validates and fixes column assignments using geometric coordinates.
|
||||
|
||||
## Goals / Non-Goals
|
||||
|
||||
**Goals:**
|
||||
- Correct column shift errors without modifying PP-Structure model
|
||||
- Use header row as authoritative column reference
|
||||
- Merge fragmented vertical text into proper cells
|
||||
- Maintain backward compatibility with existing pipeline
|
||||
|
||||
**Non-Goals:**
|
||||
- Training new OCR/structure models
|
||||
- Modifying PP-Structure's internal behavior
|
||||
- Handling tables without clear headers (future enhancement)
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
PP-Structure Output
|
||||
│
|
||||
▼
|
||||
┌───────────────────┐
|
||||
│ Table Column │
|
||||
│ Corrector │
|
||||
│ (new middleware) │
|
||||
├───────────────────┤
|
||||
│ 1. Extract header │
|
||||
│ column ranges │
|
||||
│ 2. Validate cells │
|
||||
│ 3. Correct col │
|
||||
│ assignments │
|
||||
└───────────────────┘
|
||||
│
|
||||
▼
|
||||
PDF Generator
|
||||
```
|
||||
|
||||
## Decisions
|
||||
|
||||
### Decision 1: Header-Anchor Algorithm
|
||||
|
||||
**Approach:** Use first row (row_idx=0) cells as column anchors.
|
||||
|
||||
**Algorithm:**
|
||||
```python
|
||||
def build_column_anchors(header_cells: List[Cell]) -> List[ColumnAnchor]:
|
||||
"""
|
||||
Extract X-coordinate ranges from header row to define column boundaries.
|
||||
|
||||
Returns:
|
||||
List of ColumnAnchor(col_idx, x_min, x_max)
|
||||
"""
|
||||
anchors = []
|
||||
for cell in header_cells:
|
||||
anchors.append(ColumnAnchor(
|
||||
col_idx=cell.col_idx,
|
||||
x_min=cell.bbox.x0,
|
||||
x_max=cell.bbox.x1
|
||||
))
|
||||
return sorted(anchors, key=lambda a: a.x_min)
|
||||
|
||||
|
||||
def correct_column(cell: Cell, anchors: List[ColumnAnchor]) -> int:
|
||||
"""
|
||||
Find the correct column index based on X-coordinate overlap.
|
||||
|
||||
Strategy:
|
||||
1. Calculate overlap with each column anchor
|
||||
2. If overlap > 50% with different column, correct it
|
||||
3. If no overlap, find nearest column by center point
|
||||
"""
|
||||
cell_center_x = (cell.bbox.x0 + cell.bbox.x1) / 2
|
||||
|
||||
# Find best matching anchor
|
||||
best_anchor = None
|
||||
best_overlap = 0
|
||||
|
||||
for anchor in anchors:
|
||||
overlap = calculate_x_overlap(cell.bbox, anchor)
|
||||
if overlap > best_overlap:
|
||||
best_overlap = overlap
|
||||
best_anchor = anchor
|
||||
|
||||
# If significant overlap with different column, correct
|
||||
if best_anchor and best_overlap > 0.5:
|
||||
if best_anchor.col_idx != cell.col_idx:
|
||||
logger.info(f"Correcting cell col {cell.col_idx} -> {best_anchor.col_idx}")
|
||||
return best_anchor.col_idx
|
||||
|
||||
return cell.col_idx
|
||||
```
|
||||
|
||||
**Why this approach:**
|
||||
- Headers are typically the most accurately recognized row
|
||||
- X-coordinates are objective measurements, not semantic inference
|
||||
- Simple O(n*m) complexity (n cells, m columns)
|
||||
|
||||
### Decision 2: Vertical Fragment Merging
|
||||
|
||||
**Detection criteria for vertical text fragments:**
|
||||
1. Width << Height (aspect ratio < 0.3)
|
||||
2. Located in leftmost 15% of table
|
||||
3. X-center deviation < 10px between consecutive blocks
|
||||
4. Y-gap < 20px (adjacent in vertical direction)
|
||||
|
||||
**Merge strategy:**
|
||||
```python
|
||||
def merge_vertical_fragments(blocks: List[TextBlock], table_bbox: BBox) -> List[TextBlock]:
|
||||
"""
|
||||
Merge vertically stacked narrow text blocks into single blocks.
|
||||
"""
|
||||
# Filter candidates: narrow blocks in left margin
|
||||
left_boundary = table_bbox.x0 + (table_bbox.width * 0.15)
|
||||
candidates = [b for b in blocks
|
||||
if b.width < b.height * 0.3
|
||||
and b.center_x < left_boundary]
|
||||
|
||||
# Sort by Y position
|
||||
candidates.sort(key=lambda b: b.y0)
|
||||
|
||||
# Merge adjacent blocks
|
||||
merged = []
|
||||
current_group = []
|
||||
|
||||
for block in candidates:
|
||||
if not current_group:
|
||||
current_group.append(block)
|
||||
elif should_merge(current_group[-1], block):
|
||||
current_group.append(block)
|
||||
else:
|
||||
merged.append(merge_group(current_group))
|
||||
current_group = [block]
|
||||
|
||||
if current_group:
|
||||
merged.append(merge_group(current_group))
|
||||
|
||||
return merged
|
||||
```
|
||||
|
||||
### Decision 3: Data Sources
|
||||
|
||||
**Primary source:** `cell_boxes` from PP-Structure
|
||||
- Contains accurate geometric coordinates for each detected cell
|
||||
- Independent of HTML structure recognition
|
||||
|
||||
**Secondary source:** HTML content with row/col attributes
|
||||
- Contains text content and structure
|
||||
- May have incorrect col assignments (the problem we're fixing)
|
||||
|
||||
**Correlation:** Match HTML cells to cell_boxes using IoU (Intersection over Union):
|
||||
```python
|
||||
def match_html_cell_to_cellbox(html_cell: HtmlCell, cell_boxes: List[BBox]) -> Optional[BBox]:
|
||||
"""Find the cell_box that best matches this HTML cell's position."""
|
||||
best_iou = 0
|
||||
best_box = None
|
||||
|
||||
for box in cell_boxes:
|
||||
iou = calculate_iou(html_cell.inferred_bbox, box)
|
||||
if iou > best_iou:
|
||||
best_iou = iou
|
||||
best_box = box
|
||||
|
||||
return best_box if best_iou > 0.3 else None
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
```python
|
||||
# config.py additions
|
||||
table_column_correction_enabled: bool = Field(
|
||||
default=True,
|
||||
description="Enable header-anchor column correction"
|
||||
)
|
||||
table_column_correction_threshold: float = Field(
|
||||
default=0.5,
|
||||
description="Minimum X-overlap ratio to trigger column correction"
|
||||
)
|
||||
vertical_fragment_merge_enabled: bool = Field(
|
||||
default=True,
|
||||
description="Enable vertical text fragment merging"
|
||||
)
|
||||
vertical_fragment_aspect_ratio: float = Field(
|
||||
default=0.3,
|
||||
description="Max width/height ratio to consider as vertical text"
|
||||
)
|
||||
```
|
||||
|
||||
## Risks / Trade-offs
|
||||
|
||||
| Risk | Mitigation |
|
||||
|------|------------|
|
||||
| Headers themselves misaligned | Fall back to original column assignments |
|
||||
| Multi-row headers | Support colspan detection in header extraction |
|
||||
| Tables without headers | Skip correction, use original structure |
|
||||
| Performance overhead | O(n*m) is negligible for typical table sizes |
|
||||
|
||||
## Integration Points
|
||||
|
||||
1. **Input:** PP-Structure's `table_res` containing:
|
||||
- `cell_boxes`: List of [x0, y0, x1, y1] coordinates
|
||||
- `html`: Table HTML with row/col attributes
|
||||
|
||||
2. **Output:** Corrected table structure with:
|
||||
- Updated col indices in HTML cells
|
||||
- Merged vertical text blocks
|
||||
- Diagnostic logs for corrections made
|
||||
|
||||
3. **Trigger location:** After PP-Structure table recognition, before PDF generation
|
||||
- File: `pdf_generator_service.py`
|
||||
- Method: `draw_table_region()` or new preprocessing step
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Q:** How to handle tables where header row itself is misaligned?
|
||||
**A:** Could add a secondary validation using cell_boxes grid inference, but start simple.
|
||||
|
||||
2. **Q:** Should corrections be logged for user review?
|
||||
**A:** Yes, add detailed logging with before/after column indices.
|
||||
56
openspec/changes/fix-table-column-alignment/proposal.md
Normal file
56
openspec/changes/fix-table-column-alignment/proposal.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Change: Fix Table Column Alignment with Header-Anchor Correction
|
||||
|
||||
## Why
|
||||
|
||||
PP-Structure's table structure recognition frequently outputs cells with incorrect column indices, causing "column shift" where content appears in the wrong column. This happens because:
|
||||
|
||||
1. **Semantic over Geometric**: The model infers row/col from semantic patterns rather than physical coordinates
|
||||
2. **Vertical text fragmentation**: Chinese vertical text (e.g., "报价内容") gets split into fragments
|
||||
3. **Missing left boundary**: When table's left border is unclear, cells shift left incorrectly
|
||||
|
||||
The result: A cell with X-coordinate 213 gets assigned to column 0 (range 96-162) instead of column 1 (range 204-313).
|
||||
|
||||
## What Changes
|
||||
|
||||
- **Add Header-Anchor Alignment**: Use the first row (header) X-coordinates as column reference points
|
||||
- **Add Coordinate-Based Column Correction**: Validate and correct cell column assignments based on X-coordinate overlap with header columns
|
||||
- **Add Vertical Fragment Merging**: Detect and merge vertically stacked narrow text blocks that represent vertical text
|
||||
- **Add Configuration Options**: Enable/disable correction features independently
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected specs: `document-processing`
|
||||
- Affected code:
|
||||
- `backend/app/services/table_column_corrector.py` (new)
|
||||
- `backend/app/services/pdf_generator_service.py`
|
||||
- `backend/app/core/config.py`
|
||||
|
||||
## Problem Analysis
|
||||
|
||||
### Example: scan.pdf Table 7
|
||||
|
||||
**Raw PP-Structure Output:**
|
||||
```
|
||||
Row 5: "3、適應產品..." at X=213
|
||||
Model says: col=0
|
||||
|
||||
Header Row 0:
|
||||
- Column 0 (序號): X range [96, 162]
|
||||
- Column 1 (產品名稱): X range [204, 313]
|
||||
```
|
||||
|
||||
**Problem:** X=213 is far outside column 0's range (max 162), but perfectly within column 1's range (starts at 204).
|
||||
|
||||
**Solution:** Force-correct col=0 → col=1 based on X-coordinate alignment with header.
|
||||
|
||||
### Vertical Text Issue
|
||||
|
||||
**Raw OCR:**
|
||||
```
|
||||
Block A: "报价内" at X≈100, Y=[100, 200]
|
||||
Block B: "容--" at X≈102, Y=[200, 300]
|
||||
```
|
||||
|
||||
**Problem:** These should be one cell spanning multiple rows, but appear as separate fragments.
|
||||
|
||||
**Solution:** Merge vertically aligned narrow blocks before structure recognition.
|
||||
@@ -0,0 +1,59 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: Table Column Alignment Correction
|
||||
The system SHALL correct table cell column assignments using header-anchor alignment when PP-Structure outputs incorrect column indices.
|
||||
|
||||
#### Scenario: Correct column shift using header anchors
|
||||
- **WHEN** processing a table with cell_boxes and HTML content
|
||||
- **THEN** the system SHALL extract header row (row_idx=0) column X-coordinate ranges
|
||||
- **AND** validate each cell's column assignment against header X-ranges
|
||||
- **AND** correct column index if cell X-overlap with assigned column is < 50%
|
||||
- **AND** assign cell to column with highest X-overlap
|
||||
|
||||
#### Scenario: Handle tables without headers
|
||||
- **WHEN** processing a table without a clear header row
|
||||
- **THEN** the system SHALL skip column correction
|
||||
- **AND** use original PP-Structure column assignments
|
||||
- **AND** log that header-anchor correction was skipped
|
||||
|
||||
#### Scenario: Log column corrections
|
||||
- **WHEN** a cell's column index is corrected
|
||||
- **THEN** the system SHALL log original and corrected column indices
|
||||
- **AND** include cell content snippet for debugging
|
||||
- **AND** record total corrections per table
|
||||
|
||||
### Requirement: Vertical Text Fragment Merging
|
||||
The system SHALL detect and merge vertically fragmented Chinese text blocks that represent single cells spanning multiple rows.
|
||||
|
||||
#### Scenario: Detect vertical text fragments
|
||||
- **WHEN** processing table text regions
|
||||
- **THEN** the system SHALL identify narrow text blocks (width/height ratio < 0.3)
|
||||
- **AND** filter blocks in leftmost 15% of table area
|
||||
- **AND** group vertically adjacent blocks with X-center deviation < 10px
|
||||
|
||||
#### Scenario: Merge fragmented vertical text
|
||||
- **WHEN** vertical text fragments are detected
|
||||
- **THEN** the system SHALL merge adjacent fragments into single text blocks
|
||||
- **AND** combine text content preserving reading order
|
||||
- **AND** calculate merged bounding box spanning all fragments
|
||||
- **AND** treat merged block as single cell for column assignment
|
||||
|
||||
#### Scenario: Preserve non-vertical text
|
||||
- **WHEN** text blocks do not meet vertical fragment criteria
|
||||
- **THEN** the system SHALL preserve original text block boundaries
|
||||
- **AND** process normally without merging
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Extract table structure
|
||||
The system SHALL extract cell content and boundaries from PP-StructureV3 tables, with post-processing correction for column alignment errors.
|
||||
|
||||
#### Scenario: Extract table structure with correction
|
||||
- **WHEN** PP-StructureV3 identifies a table
|
||||
- **THEN** the system SHALL extract cell content and boundaries
|
||||
- **AND** validate cell_boxes coordinates against page boundaries
|
||||
- **AND** apply header-anchor column correction when enabled
|
||||
- **AND** merge vertical text fragments when enabled
|
||||
- **AND** apply fallback detection for invalid coordinates
|
||||
- **AND** preserve table HTML for structure
|
||||
- **AND** extract plain text for translation
|
||||
59
openspec/changes/fix-table-column-alignment/tasks.md
Normal file
59
openspec/changes/fix-table-column-alignment/tasks.md
Normal file
@@ -0,0 +1,59 @@
|
||||
## 1. Core Algorithm Implementation
|
||||
|
||||
### 1.1 Table Column Corrector Module
|
||||
- [x] 1.1.1 Create `table_column_corrector.py` service file
|
||||
- [x] 1.1.2 Implement `ColumnAnchor` dataclass for header column ranges
|
||||
- [x] 1.1.3 Implement `build_column_anchors()` to extract header column X-ranges
|
||||
- [x] 1.1.4 Implement `calculate_x_overlap()` utility function
|
||||
- [x] 1.1.5 Implement `correct_cell_column()` for single cell correction
|
||||
- [x] 1.1.6 Implement `correct_table_columns()` main entry point
|
||||
|
||||
### 1.2 HTML Cell Extraction
|
||||
- [x] 1.2.1 Implement `parse_table_html_with_positions()` to extract cells with row/col
|
||||
- [x] 1.2.2 Implement cell-to-cellbox matching using IoU
|
||||
- [x] 1.2.3 Handle colspan/rowspan in header detection
|
||||
|
||||
### 1.3 Vertical Fragment Merging
|
||||
- [x] 1.3.1 Implement `detect_vertical_fragments()` to find narrow text blocks
|
||||
- [x] 1.3.2 Implement `should_merge_blocks()` adjacency check
|
||||
- [x] 1.3.3 Implement `merge_vertical_fragments()` main function
|
||||
- [x] 1.3.4 Integrate merged blocks back into table structure
|
||||
|
||||
## 2. Configuration
|
||||
|
||||
### 2.1 Settings
|
||||
- [x] 2.1.1 Add `table_column_correction_enabled: bool = True`
|
||||
- [x] 2.1.2 Add `table_column_correction_threshold: float = 0.5`
|
||||
- [x] 2.1.3 Add `vertical_fragment_merge_enabled: bool = True`
|
||||
- [x] 2.1.4 Add `vertical_fragment_aspect_ratio: float = 0.3`
|
||||
|
||||
## 3. Integration
|
||||
|
||||
### 3.1 Pipeline Integration
|
||||
- [x] 3.1.1 Add correction step in `pdf_generator_service.py` before table rendering
|
||||
- [x] 3.1.2 Pass corrected HTML to existing table rendering logic
|
||||
- [x] 3.1.3 Add diagnostic logging for corrections made
|
||||
|
||||
### 3.2 Error Handling
|
||||
- [x] 3.2.1 Handle tables without headers gracefully
|
||||
- [x] 3.2.2 Handle empty/malformed cell_boxes
|
||||
- [x] 3.2.3 Fallback to original structure on correction failure
|
||||
|
||||
## 4. Testing
|
||||
|
||||
### 4.1 Unit Tests
|
||||
- [ ] 4.1.1 Test `build_column_anchors()` with various header configurations
|
||||
- [ ] 4.1.2 Test `correct_cell_column()` with known column shift cases
|
||||
- [ ] 4.1.3 Test `merge_vertical_fragments()` with vertical text samples
|
||||
- [ ] 4.1.4 Test edge cases: empty tables, single column, no headers
|
||||
|
||||
### 4.2 Integration Tests
|
||||
- [ ] 4.2.1 Test with `scan.pdf` Table 7 (the problematic case)
|
||||
- [ ] 4.2.2 Test with tables that have correct alignment (no regression)
|
||||
- [ ] 4.2.3 Visual comparison of corrected vs original output
|
||||
|
||||
## 5. Documentation
|
||||
|
||||
- [x] 5.1 Add inline code comments explaining correction algorithm
|
||||
- [x] 5.2 Update spec with new table column correction requirement
|
||||
- [x] 5.3 Add logging messages for debugging
|
||||
Reference in New Issue
Block a user