Files
OCR/openspec/changes/fix-table-column-alignment/specs/document-processing/spec.md
egg 940a406dce chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00

60 lines
2.8 KiB
Markdown

## ADDED Requirements
### Requirement: Table Column Alignment Correction
The system SHALL correct table cell column assignments using header-anchor alignment when PP-Structure outputs incorrect column indices.
#### Scenario: Correct column shift using header anchors
- **WHEN** processing a table with cell_boxes and HTML content
- **THEN** the system SHALL extract header row (row_idx=0) column X-coordinate ranges
- **AND** validate each cell's column assignment against header X-ranges
- **AND** correct column index if cell X-overlap with assigned column is < 50%
- **AND** assign cell to column with highest X-overlap
#### Scenario: Handle tables without headers
- **WHEN** processing a table without a clear header row
- **THEN** the system SHALL skip column correction
- **AND** use original PP-Structure column assignments
- **AND** log that header-anchor correction was skipped
#### Scenario: Log column corrections
- **WHEN** a cell's column index is corrected
- **THEN** the system SHALL log original and corrected column indices
- **AND** include cell content snippet for debugging
- **AND** record total corrections per table
### Requirement: Vertical Text Fragment Merging
The system SHALL detect and merge vertically fragmented Chinese text blocks that represent single cells spanning multiple rows.
#### Scenario: Detect vertical text fragments
- **WHEN** processing table text regions
- **THEN** the system SHALL identify narrow text blocks (width/height ratio < 0.3)
- **AND** filter blocks in leftmost 15% of table area
- **AND** group vertically adjacent blocks with X-center deviation < 10px
#### Scenario: Merge fragmented vertical text
- **WHEN** vertical text fragments are detected
- **THEN** the system SHALL merge adjacent fragments into single text blocks
- **AND** combine text content preserving reading order
- **AND** calculate merged bounding box spanning all fragments
- **AND** treat merged block as single cell for column assignment
#### Scenario: Preserve non-vertical text
- **WHEN** text blocks do not meet vertical fragment criteria
- **THEN** the system SHALL preserve original text block boundaries
- **AND** process normally without merging
## MODIFIED Requirements
### Requirement: Extract table structure
The system SHALL extract cell content and boundaries from PP-StructureV3 tables, with post-processing correction for column alignment errors.
#### Scenario: Extract table structure with correction
- **WHEN** PP-StructureV3 identifies a table
- **THEN** the system SHALL extract cell content and boundaries
- **AND** validate cell_boxes coordinates against page boundaries
- **AND** apply header-anchor column correction when enabled
- **AND** merge vertical text fragments when enabled
- **AND** apply fallback detection for invalid coordinates
- **AND** preserve table HTML for structure
- **AND** extract plain text for translation