chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -67,7 +67,7 @@ The system SHALL use a standardized UnifiedDocument model as the common output f
|
||||
- **AND** support identical downstream operations (PDF generation, translation)
|
||||
|
||||
### Requirement: Enhanced OCR with Full PP-StructureV3
|
||||
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
|
||||
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
|
||||
|
||||
#### Scenario: Extract comprehensive document structure
|
||||
- **WHEN** processing through OCR track
|
||||
@@ -84,9 +84,17 @@ The system SHALL utilize the full capabilities of PP-StructureV3, extracting all
|
||||
#### Scenario: Extract table structure
|
||||
- **WHEN** PP-StructureV3 identifies a table
|
||||
- **THEN** the system SHALL extract cell content and boundaries
|
||||
- **AND** validate cell_boxes coordinates against page boundaries
|
||||
- **AND** apply fallback detection for invalid coordinates
|
||||
- **AND** preserve table HTML for structure
|
||||
- **AND** extract plain text for translation
|
||||
|
||||
#### Scenario: Extract visual elements with paths
|
||||
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
|
||||
- **THEN** the system SHALL preserve saved_path for each element
|
||||
- **AND** include image dimensions and format
|
||||
- **AND** enable image embedding in output PDF
|
||||
|
||||
### Requirement: Structure-Preserving Translation Foundation
|
||||
The system SHALL maintain document structure and layout information to support future translation features.
|
||||
|
||||
@@ -108,3 +116,26 @@ The system SHALL maintain document structure and layout information to support f
|
||||
- **AND** calculate maximum text expansion ratios
|
||||
- **AND** preserve non-translatable elements (logos, signatures)
|
||||
|
||||
### Requirement: Generate UnifiedDocument from direct extraction
|
||||
The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
|
||||
|
||||
#### Scenario: Extract tables with cell merging
|
||||
- **WHEN** direct extraction encounters a table
|
||||
- **THEN** the system SHALL use PyMuPDF find_tables() API
|
||||
- **AND** extract cell content with correct rowspan/colspan
|
||||
- **AND** preserve merged cell boundaries
|
||||
- **AND** skip placeholder cells covered by merges
|
||||
|
||||
#### Scenario: Filter decoration images
|
||||
- **WHEN** extracting images from PDF
|
||||
- **THEN** the system SHALL filter images smaller than minimum area threshold
|
||||
- **AND** exclude covering/redaction images
|
||||
- **AND** preserve meaningful content images
|
||||
|
||||
#### Scenario: Preserve text styling with image handling
|
||||
- **WHEN** direct extraction completes
|
||||
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
|
||||
- **AND** preserve text styling, fonts, and exact positioning
|
||||
- **AND** extract tables with cell boundaries, content, and merge info
|
||||
- **AND** include only meaningful images in output
|
||||
|
||||
|
||||
Reference in New Issue
Block a user