Files
OCR/openspec/changes/archive/2025-12-11-fix-table-column-alignment/tasks.md
egg cfe65158a3 feat: enable document orientation detection for scanned PDFs
- Enable PP-StructureV3's use_doc_orientation_classify feature
- Detect rotation angle from doc_preprocessor_res.angle
- Swap page dimensions (width <-> height) for 90°/270° rotations
- Output PDF now correctly displays landscape-scanned content

Also includes:
- Archive completed openspec proposals
- Add simplify-frontend-ocr-config proposal (pending)
- Code cleanup and frontend simplification

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 17:13:46 +08:00

2.5 KiB

1. Core Algorithm Implementation

1.1 Table Column Corrector Module

  • 1.1.1 Create table_column_corrector.py service file
  • 1.1.2 Implement ColumnAnchor dataclass for header column ranges
  • 1.1.3 Implement build_column_anchors() to extract header column X-ranges
  • 1.1.4 Implement calculate_x_overlap() utility function
  • 1.1.5 Implement correct_cell_column() for single cell correction
  • 1.1.6 Implement correct_table_columns() main entry point

1.2 HTML Cell Extraction

  • 1.2.1 Implement parse_table_html_with_positions() to extract cells with row/col
  • 1.2.2 Implement cell-to-cellbox matching using IoU
  • 1.2.3 Handle colspan/rowspan in header detection

1.3 Vertical Fragment Merging

  • 1.3.1 Implement detect_vertical_fragments() to find narrow text blocks
  • 1.3.2 Implement should_merge_blocks() adjacency check
  • 1.3.3 Implement merge_vertical_fragments() main function
  • 1.3.4 Integrate merged blocks back into table structure

2. Configuration

2.1 Settings

  • 2.1.1 Add table_column_correction_enabled: bool = True
  • 2.1.2 Add table_column_correction_threshold: float = 0.5
  • 2.1.3 Add vertical_fragment_merge_enabled: bool = True
  • 2.1.4 Add vertical_fragment_aspect_ratio: float = 0.3

3. Integration

3.1 Pipeline Integration

  • 3.1.1 Add correction step in pdf_generator_service.py before table rendering
  • 3.1.2 Pass corrected HTML to existing table rendering logic
  • 3.1.3 Add diagnostic logging for corrections made

3.2 Error Handling

  • 3.2.1 Handle tables without headers gracefully
  • 3.2.2 Handle empty/malformed cell_boxes
  • 3.2.3 Fallback to original structure on correction failure

4. Testing

4.1 Unit Tests

  • 4.1.1 Test build_column_anchors() with various header configurations
  • 4.1.2 Test correct_cell_column() with known column shift cases
  • 4.1.3 Test merge_vertical_fragments() with vertical text samples
  • 4.1.4 Test edge cases: empty tables, single column, no headers

4.2 Integration Tests

  • 4.2.1 Test with scan.pdf Table 7 (the problematic case)
  • 4.2.2 Test with tables that have correct alignment (no regression)
  • 4.2.3 Visual comparison of corrected vs original output

5. Documentation

  • 5.1 Add inline code comments explaining correction algorithm
  • 5.2 Update spec with new table column correction requirement
  • 5.3 Add logging messages for debugging