feat: enable document orientation detection for scanned PDFs
- Enable PP-StructureV3's use_doc_orientation_classify feature - Detect rotation angle from doc_preprocessor_res.angle - Swap page dimensions (width <-> height) for 90°/270° rotations - Output PDF now correctly displays landscape-scanned content Also includes: - Archive completed openspec proposals - Add simplify-frontend-ocr-config proposal (pending) - Code cleanup and frontend simplification 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,36 @@
|
||||
# document-processing Specification Delta
|
||||
|
||||
## MODIFIED Requirements
|
||||
|
||||
### Requirement: Extract table structure (Modified)
|
||||
|
||||
The system SHALL use cell_boxes coordinates as the primary source for table structure when rendering PDFs, with HTML parsing as fallback.
|
||||
|
||||
#### Scenario: Render table using cell_boxes grid
|
||||
- **WHEN** rendering a table element to PDF
|
||||
- **AND** the table has valid cell_boxes coordinates
|
||||
- **AND** `table_rendering_prefer_cellboxes` is enabled
|
||||
- **THEN** the system SHALL infer row/column grid from cell_boxes coordinates
|
||||
- **AND** extract text content from HTML in reading order
|
||||
- **AND** map content to grid cells by position
|
||||
- **AND** render table borders using cell_boxes coordinates
|
||||
- **AND** place text content within calculated cell boundaries
|
||||
|
||||
#### Scenario: Handle cell_boxes grid mismatch gracefully
|
||||
- **WHEN** cell_boxes grid has different dimensions than HTML colspan/rowspan structure
|
||||
- **THEN** the system SHALL use cell_boxes grid as authoritative structure
|
||||
- **AND** map available HTML content to cells row-by-row
|
||||
- **AND** leave unmapped cells empty
|
||||
- **AND** log warning if content count differs significantly
|
||||
|
||||
#### Scenario: Fallback to HTML-based rendering
|
||||
- **WHEN** cell_boxes is empty or None
|
||||
- **OR** `table_rendering_prefer_cellboxes` is disabled
|
||||
- **OR** cell_boxes grid inference fails
|
||||
- **THEN** the system SHALL fall back to existing HTML-based table rendering
|
||||
- **AND** use ReportLab Table with parsed HTML structure
|
||||
|
||||
#### Scenario: Maintain backward compatibility
|
||||
- **WHEN** processing tables where cell_boxes grid matches HTML structure
|
||||
- **THEN** the system SHALL produce identical output to previous behavior
|
||||
- **AND** pass all existing table rendering tests
|
||||
Reference in New Issue
Block a user