feat: simplify layout model selection and archive proposals

Changes:
- Replace PP-Structure 7-slider parameter UI with simple 3-option layout model selector
- Add layout model mapping: chinese (PP-DocLayout-S), default (PubLayNet), cdla
- Add LayoutModelSelector component and zh-TW translations
- Fix "default" model behavior with sentinel value for PubLayNet
- Add gap filling service for OCR track coverage improvement
- Add PP-Structure debug utilities
- Archive completed/incomplete proposals:
  - add-ocr-track-gap-filling (complete)
  - fix-ocr-track-table-rendering (incomplete)
  - simplify-ppstructure-model-selection (22/25 tasks)
- Add new layout model tests, archive old PP-Structure param tests
- Update OpenSpec ocr-processing spec with layout model requirements

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
egg
2025-11-27 13:27:00 +08:00
parent c65df754cf
commit 59206a6ab8
35 changed files with 3621 additions and 658 deletions

View File

@@ -0,0 +1,61 @@
## ADDED Requirements
### Requirement: OCR Table Empty Column Cleanup
The OCR Track converter SHALL clean up PP-Structure generated tables by removing columns where all rows have empty or whitespace-only content.
The system SHALL:
1. Identify columns where every cell's content is empty or contains only whitespace (using `.strip()` to determine emptiness)
2. Remove identified empty columns from the table structure
3. Update the `columns`/`cols` value to reflect the new column count
4. Recalculate each cell's `col` index to maintain continuity
5. Adjust `col_span` values when spans cross removed columns (shrink span size)
6. Remove cells entirely when their complete span falls within removed columns
7. Preserve original bbox and page coordinates (no layout drift)
8. If `columns` is 0 or missing after cleanup, fill with the calculated column count
The cleanup SHALL NOT:
- Remove columns where the header is empty but data rows contain values
- Modify tables in Direct or HYBRID track
- Alter the original bbox coordinates
#### Scenario: All rows in column are empty
- **WHEN** a table has a column where all cells contain only empty or whitespace content
- **THEN** that column is removed
- **AND** remaining cells have their `col` indices decremented appropriately
- **AND** `cols` count is reduced by 1
#### Scenario: Column has empty header but data has values
- **WHEN** a table has a column where the header cell is empty
- **AND** at least one data row cell in that column contains non-whitespace content
- **THEN** that column is NOT removed
#### Scenario: Cell span crosses removed column
- **WHEN** a cell has `col_span > 1`
- **AND** one or more columns within the span are removed
- **THEN** the `col_span` is reduced by the number of removed columns within the span
#### Scenario: Cell span entirely within removed columns
- **WHEN** a cell's entire span falls within columns that are all removed
- **THEN** that cell is removed from the table
#### Scenario: Missing columns metadata
- **WHEN** the table dict has `columns` set to 0 or missing
- **AFTER** cleanup is performed
- **THEN** `columns` is set to the calculated number of remaining columns
### Requirement: OCR Table Column Alignment by Bbox
(Optional Enhancement) When bbox coordinates are available for table cells, the OCR Track converter SHALL use cell bbox x0 coordinates to improve column alignment accuracy.
The system SHALL:
1. Sort cells by bbox `x0` coordinate before assigning column indices
2. Reassign `col` indices based on spatial position rather than HTML order
This requirement is optional and implementation MAY be deferred if bbox data is not reliably available.
#### Scenario: Cells reordered by bbox position
- **WHEN** bbox coordinates are available for table cells
- **AND** the original HTML order does not match spatial order
- **THEN** cells are reordered by `x0` coordinate
- **AND** `col` indices are reassigned to reflect spatial positioning