chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
egg
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions

View File

@@ -67,7 +67,7 @@ The system SHALL use a standardized UnifiedDocument model as the common output f
- **AND** support identical downstream operations (PDF generation, translation)
### Requirement: Enhanced OCR with Full PP-StructureV3
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.
#### Scenario: Extract comprehensive document structure
- **WHEN** processing through OCR track
@@ -84,9 +84,17 @@ The system SHALL utilize the full capabilities of PP-StructureV3, extracting all
#### Scenario: Extract table structure
- **WHEN** PP-StructureV3 identifies a table
- **THEN** the system SHALL extract cell content and boundaries
- **AND** validate cell_boxes coordinates against page boundaries
- **AND** apply fallback detection for invalid coordinates
- **AND** preserve table HTML for structure
- **AND** extract plain text for translation
#### Scenario: Extract visual elements with paths
- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
- **THEN** the system SHALL preserve saved_path for each element
- **AND** include image dimensions and format
- **AND** enable image embedding in output PDF
### Requirement: Structure-Preserving Translation Foundation
The system SHALL maintain document structure and layout information to support future translation features.
@@ -108,3 +116,26 @@ The system SHALL maintain document structure and layout information to support f
- **AND** calculate maximum text expansion ratios
- **AND** preserve non-translatable elements (logos, signatures)
### Requirement: Generate UnifiedDocument from direct extraction
The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
#### Scenario: Extract tables with cell merging
- **WHEN** direct extraction encounters a table
- **THEN** the system SHALL use PyMuPDF find_tables() API
- **AND** extract cell content with correct rowspan/colspan
- **AND** preserve merged cell boundaries
- **AND** skip placeholder cells covered by merges
#### Scenario: Filter decoration images
- **WHEN** extracting images from PDF
- **THEN** the system SHALL filter images smaller than minimum area threshold
- **AND** exclude covering/redaction images
- **AND** preserve meaningful content images
#### Scenario: Preserve text styling with image handling
- **WHEN** direct extraction completes
- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
- **AND** preserve text styling, fonts, and exact positioning
- **AND** extract tables with cell boundaries, content, and merge info
- **AND** include only meaningful images in output