Files
OCR/openspec/changes/fix-table-column-alignment/specs/document-processing/spec.md
egg 940a406dce chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00

2.8 KiB

ADDED Requirements

Requirement: Table Column Alignment Correction

The system SHALL correct table cell column assignments using header-anchor alignment when PP-Structure outputs incorrect column indices.

Scenario: Correct column shift using header anchors

  • WHEN processing a table with cell_boxes and HTML content
  • THEN the system SHALL extract header row (row_idx=0) column X-coordinate ranges
  • AND validate each cell's column assignment against header X-ranges
  • AND correct column index if cell X-overlap with assigned column is < 50%
  • AND assign cell to column with highest X-overlap

Scenario: Handle tables without headers

  • WHEN processing a table without a clear header row
  • THEN the system SHALL skip column correction
  • AND use original PP-Structure column assignments
  • AND log that header-anchor correction was skipped

Scenario: Log column corrections

  • WHEN a cell's column index is corrected
  • THEN the system SHALL log original and corrected column indices
  • AND include cell content snippet for debugging
  • AND record total corrections per table

Requirement: Vertical Text Fragment Merging

The system SHALL detect and merge vertically fragmented Chinese text blocks that represent single cells spanning multiple rows.

Scenario: Detect vertical text fragments

  • WHEN processing table text regions
  • THEN the system SHALL identify narrow text blocks (width/height ratio < 0.3)
  • AND filter blocks in leftmost 15% of table area
  • AND group vertically adjacent blocks with X-center deviation < 10px

Scenario: Merge fragmented vertical text

  • WHEN vertical text fragments are detected
  • THEN the system SHALL merge adjacent fragments into single text blocks
  • AND combine text content preserving reading order
  • AND calculate merged bounding box spanning all fragments
  • AND treat merged block as single cell for column assignment

Scenario: Preserve non-vertical text

  • WHEN text blocks do not meet vertical fragment criteria
  • THEN the system SHALL preserve original text block boundaries
  • AND process normally without merging

MODIFIED Requirements

Requirement: Extract table structure

The system SHALL extract cell content and boundaries from PP-StructureV3 tables, with post-processing correction for column alignment errors.

Scenario: Extract table structure with correction

  • WHEN PP-StructureV3 identifies a table
  • THEN the system SHALL extract cell content and boundaries
  • AND validate cell_boxes coordinates against page boundaries
  • AND apply header-anchor column correction when enabled
  • AND merge vertical text fragments when enabled
  • AND apply fallback detection for invalid coordinates
  • AND preserve table HTML for structure
  • AND extract plain text for translation