Changes: - Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector - Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla - Add LayoutModelSelector component and zh-TW translations - Fix "default" model behavior with sentinel value for PubLayNet - Add gap filling service for OCR track coverage improvement - Add PP-Structure debug utilities - Archive completed/incomplete proposals: - add-ocr-track-gap-filling (complete) - fix-ocr-track-table-rendering (incomplete) - simplify-ppstructure-model-selection (22/25 tasks) - Add new layout model tests, archive old PP-Structure param tests - Update OpenSpec ocr-processing spec with layout model requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2.9 KiB
2.9 KiB
ADDED Requirements
Requirement: OCR Table Empty Column Cleanup
The OCR Track converter SHALL clean up PP-Structure generated tables by removing columns where all rows have empty or whitespace-only content.
The system SHALL:
- Identify columns where every cell's content is empty or contains only whitespace (using
.strip()to determine emptiness) - Remove identified empty columns from the table structure
- Update the
columns/colsvalue to reflect the new column count - Recalculate each cell's
colindex to maintain continuity - Adjust
col_spanvalues when spans cross removed columns (shrink span size) - Remove cells entirely when their complete span falls within removed columns
- Preserve original bbox and page coordinates (no layout drift)
- If
columnsis 0 or missing after cleanup, fill with the calculated column count
The cleanup SHALL NOT:
- Remove columns where the header is empty but data rows contain values
- Modify tables in Direct or HYBRID track
- Alter the original bbox coordinates
Scenario: All rows in column are empty
- WHEN a table has a column where all cells contain only empty or whitespace content
- THEN that column is removed
- AND remaining cells have their
colindices decremented appropriately - AND
colscount is reduced by 1
Scenario: Column has empty header but data has values
- WHEN a table has a column where the header cell is empty
- AND at least one data row cell in that column contains non-whitespace content
- THEN that column is NOT removed
Scenario: Cell span crosses removed column
- WHEN a cell has
col_span > 1 - AND one or more columns within the span are removed
- THEN the
col_spanis reduced by the number of removed columns within the span
Scenario: Cell span entirely within removed columns
- WHEN a cell's entire span falls within columns that are all removed
- THEN that cell is removed from the table
Scenario: Missing columns metadata
- WHEN the table dict has
columnsset to 0 or missing - AFTER cleanup is performed
- THEN
columnsis set to the calculated number of remaining columns
Requirement: OCR Table Column Alignment by Bbox
(Optional Enhancement) When bbox coordinates are available for table cells, the OCR Track converter SHALL use cell bbox x0 coordinates to improve column alignment accuracy.
The system SHALL:
- Sort cells by bbox
x0coordinate before assigning column indices - Reassign
colindices based on spatial position rather than HTML order
This requirement is optional and implementation MAY be deferred if bbox data is not reliably available.
Scenario: Cells reordered by bbox position
- WHEN bbox coordinates are available for table cells
- AND the original HTML order does not match spatial order
- THEN cells are reordered by
x0coordinate - AND
colindices are reassigned to reflect spatial positioning