Files
OCR/openspec/changes/archive/2025-11-26-fix-ocr-table-empty-columns/specs/ocr-processing/spec.md
egg 59206a6ab8 feat: simplify layout model selection and archive proposals
Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 13:27:00 +08:00

2.9 KiB

ADDED Requirements

Requirement: OCR Table Empty Column Cleanup

The OCR Track converter SHALL clean up PP-Structure generated tables by removing columns where all rows have empty or whitespace-only content.

The system SHALL:

  1. Identify columns where every cell's content is empty or contains only whitespace (using .strip() to determine emptiness)
  2. Remove identified empty columns from the table structure
  3. Update the columns/cols value to reflect the new column count
  4. Recalculate each cell's col index to maintain continuity
  5. Adjust col_span values when spans cross removed columns (shrink span size)
  6. Remove cells entirely when their complete span falls within removed columns
  7. Preserve original bbox and page coordinates (no layout drift)
  8. If columns is 0 or missing after cleanup, fill with the calculated column count

The cleanup SHALL NOT:

  • Remove columns where the header is empty but data rows contain values
  • Modify tables in Direct or HYBRID track
  • Alter the original bbox coordinates

Scenario: All rows in column are empty

  • WHEN a table has a column where all cells contain only empty or whitespace content
  • THEN that column is removed
  • AND remaining cells have their col indices decremented appropriately
  • AND cols count is reduced by 1

Scenario: Column has empty header but data has values

  • WHEN a table has a column where the header cell is empty
  • AND at least one data row cell in that column contains non-whitespace content
  • THEN that column is NOT removed

Scenario: Cell span crosses removed column

  • WHEN a cell has col_span > 1
  • AND one or more columns within the span are removed
  • THEN the col_span is reduced by the number of removed columns within the span

Scenario: Cell span entirely within removed columns

  • WHEN a cell's entire span falls within columns that are all removed
  • THEN that cell is removed from the table

Scenario: Missing columns metadata

  • WHEN the table dict has columns set to 0 or missing
  • AFTER cleanup is performed
  • THEN columns is set to the calculated number of remaining columns

Requirement: OCR Table Column Alignment by Bbox

(Optional Enhancement) When bbox coordinates are available for table cells, the OCR Track converter SHALL use cell bbox x0 coordinates to improve column alignment accuracy.

The system SHALL:

  1. Sort cells by bbox x0 coordinate before assigning column indices
  2. Reassign col indices based on spatial position rather than HTML order

This requirement is optional and implementation MAY be deferred if bbox data is not reliably available.

Scenario: Cells reordered by bbox position

  • WHEN bbox coordinates are available for table cells
  • AND the original HTML order does not match spatial order
  • THEN cells are reordered by x0 coordinate
  • AND col indices are reassigned to reflect spatial positioning