chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions
--- a/openspec/specs/document-processing/spec.md
+++ b/openspec/specs/document-processing/spec.md
@@ -67,7 +67,7 @@ The system SHALL use a standardized UnifiedDocument model as the common output f
 - **AND** support identical downstream operations (PDF generation, translation)

 ### Requirement: Enhanced OCR with Full PP-StructureV3
-The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
+The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.

 #### Scenario: Extract comprehensive document structure
 - **WHEN** processing through OCR track
@@ -84,9 +84,17 @@ The system SHALL utilize the full capabilities of PP-StructureV3, extracting all
 #### Scenario: Extract table structure
 - **WHEN** PP-StructureV3 identifies a table
 - **THEN** the system SHALL extract cell content and boundaries
+- **AND** validate cell_boxes coordinates against page boundaries
+- **AND** apply fallback detection for invalid coordinates
 - **AND** preserve table HTML for structure
 - **AND** extract plain text for translation

+#### Scenario: Extract visual elements with paths
+- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
+- **THEN** the system SHALL preserve saved_path for each element
+- **AND** include image dimensions and format
+- **AND** enable image embedding in output PDF
+
 ### Requirement: Structure-Preserving Translation Foundation
 The system SHALL maintain document structure and layout information to support future translation features.

@@ -108,3 +116,26 @@ The system SHALL maintain document structure and layout information to support f
 - **AND** calculate maximum text expansion ratios
 - **AND** preserve non-translatable elements (logos, signatures)

+### Requirement: Generate UnifiedDocument from direct extraction
+The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
+
+#### Scenario: Extract tables with cell merging
+- **WHEN** direct extraction encounters a table
+- **THEN** the system SHALL use PyMuPDF find_tables() API
+- **AND** extract cell content with correct rowspan/colspan
+- **AND** preserve merged cell boundaries
+- **AND** skip placeholder cells covered by merges
+
+#### Scenario: Filter decoration images
+- **WHEN** extracting images from PDF
+- **THEN** the system SHALL filter images smaller than minimum area threshold
+- **AND** exclude covering/redaction images
+- **AND** preserve meaningful content images
+
+#### Scenario: Preserve text styling with image handling
+- **WHEN** direct extraction completes
+- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
+- **AND** preserve text styling, fonts, and exact positioning
+- **AND** extract tables with cell boundaries, content, and merge info
+- **AND** include only meaningful images in output
+