feat: simplify layout model selection and archive proposals
Changes: - Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector - Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla - Add LayoutModelSelector component and zh-TW translations - Fix "default" model behavior with sentinel value for PubLayNet - Add gap filling service for OCR track coverage improvement - Add PP-Structure debug utilities - Archive completed/incomplete proposals: - add-ocr-track-gap-filling (complete) - fix-ocr-track-table-rendering (incomplete) - simplify-ppstructure-model-selection (22/25 tasks) - Add new layout model tests, archive old PP-Structure param tests - Update OpenSpec ocr-processing spec with layout model requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,61 @@
|
||||
## ADDED Requirements
|
||||
|
||||
### Requirement: OCR Table Empty Column Cleanup
|
||||
|
||||
The OCR Track converter SHALL clean up PP-Structure generated tables by removing columns where all rows have empty or whitespace-only content.
|
||||
|
||||
The system SHALL:
|
||||
1. Identify columns where every cell's content is empty or contains only whitespace (using `.strip()` to determine emptiness)
|
||||
2. Remove identified empty columns from the table structure
|
||||
3. Update the `columns`/`cols` value to reflect the new column count
|
||||
4. Recalculate each cell's `col` index to maintain continuity
|
||||
5. Adjust `col_span` values when spans cross removed columns (shrink span size)
|
||||
6. Remove cells entirely when their complete span falls within removed columns
|
||||
7. Preserve original bbox and page coordinates (no layout drift)
|
||||
8. If `columns` is 0 or missing after cleanup, fill with the calculated column count
|
||||
|
||||
The cleanup SHALL NOT:
|
||||
- Remove columns where the header is empty but data rows contain values
|
||||
- Modify tables in Direct or HYBRID track
|
||||
- Alter the original bbox coordinates
|
||||
|
||||
#### Scenario: All rows in column are empty
|
||||
- **WHEN** a table has a column where all cells contain only empty or whitespace content
|
||||
- **THEN** that column is removed
|
||||
- **AND** remaining cells have their `col` indices decremented appropriately
|
||||
- **AND** `cols` count is reduced by 1
|
||||
|
||||
#### Scenario: Column has empty header but data has values
|
||||
- **WHEN** a table has a column where the header cell is empty
|
||||
- **AND** at least one data row cell in that column contains non-whitespace content
|
||||
- **THEN** that column is NOT removed
|
||||
|
||||
#### Scenario: Cell span crosses removed column
|
||||
- **WHEN** a cell has `col_span > 1`
|
||||
- **AND** one or more columns within the span are removed
|
||||
- **THEN** the `col_span` is reduced by the number of removed columns within the span
|
||||
|
||||
#### Scenario: Cell span entirely within removed columns
|
||||
- **WHEN** a cell's entire span falls within columns that are all removed
|
||||
- **THEN** that cell is removed from the table
|
||||
|
||||
#### Scenario: Missing columns metadata
|
||||
- **WHEN** the table dict has `columns` set to 0 or missing
|
||||
- **AFTER** cleanup is performed
|
||||
- **THEN** `columns` is set to the calculated number of remaining columns
|
||||
|
||||
### Requirement: OCR Table Column Alignment by Bbox
|
||||
|
||||
(Optional Enhancement) When bbox coordinates are available for table cells, the OCR Track converter SHALL use cell bbox x0 coordinates to improve column alignment accuracy.
|
||||
|
||||
The system SHALL:
|
||||
1. Sort cells by bbox `x0` coordinate before assigning column indices
|
||||
2. Reassign `col` indices based on spatial position rather than HTML order
|
||||
|
||||
This requirement is optional and implementation MAY be deferred if bbox data is not reliably available.
|
||||
|
||||
#### Scenario: Cells reordered by bbox position
|
||||
- **WHEN** bbox coordinates are available for table cells
|
||||
- **AND** the original HTML order does not match spatial order
|
||||
- **THEN** cells are reordered by `x0` coordinate
|
||||
- **AND** `col` indices are reassigned to reflect spatial positioning
|
||||
Reference in New Issue
Block a user