chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00
parent eff9b0bcd5
commit 940a406dce
58 changed files with 8226 additions and 175 deletions
--- a/openspec/specs/document-processing/spec.md
+++ b/openspec/specs/document-processing/spec.md
@@ -67,7 +67,7 @@ The system SHALL use a standardized UnifiedDocument model as the common output f
 - **AND** support identical downstream operations (PDF generation, translation)

 ### Requirement: Enhanced OCR with Full PP-StructureV3
-The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list.
+The system SHALL utilize the full capabilities of PP-StructureV3, extracting all 23 element types from parsing_res_list, with proper handling of visual elements and table coordinates.

 #### Scenario: Extract comprehensive document structure
 - **WHEN** processing through OCR track
@@ -84,9 +84,17 @@ The system SHALL utilize the full capabilities of PP-StructureV3, extracting all
 #### Scenario: Extract table structure
 - **WHEN** PP-StructureV3 identifies a table
 - **THEN** the system SHALL extract cell content and boundaries
+- **AND** validate cell_boxes coordinates against page boundaries
+- **AND** apply fallback detection for invalid coordinates
 - **AND** preserve table HTML for structure
 - **AND** extract plain text for translation

+#### Scenario: Extract visual elements with paths
+- **WHEN** PP-StructureV3 identifies visual elements (IMAGE, FIGURE, CHART, DIAGRAM)
+- **THEN** the system SHALL preserve saved_path for each element
+- **AND** include image dimensions and format
+- **AND** enable image embedding in output PDF
+
 ### Requirement: Structure-Preserving Translation Foundation
 The system SHALL maintain document structure and layout information to support future translation features.

@@ -108,3 +116,26 @@ The system SHALL maintain document structure and layout information to support f
 - **AND** calculate maximum text expansion ratios
 - **AND** preserve non-translatable elements (logos, signatures)

+### Requirement: Generate UnifiedDocument from direct extraction
+The system SHALL convert PyMuPDF results to UnifiedDocument with correct table cell merging.
+
+#### Scenario: Extract tables with cell merging
+- **WHEN** direct extraction encounters a table
+- **THEN** the system SHALL use PyMuPDF find_tables() API
+- **AND** extract cell content with correct rowspan/colspan
+- **AND** preserve merged cell boundaries
+- **AND** skip placeholder cells covered by merges
+
+#### Scenario: Filter decoration images
+- **WHEN** extracting images from PDF
+- **THEN** the system SHALL filter images smaller than minimum area threshold
+- **AND** exclude covering/redaction images
+- **AND** preserve meaningful content images
+
+#### Scenario: Preserve text styling with image handling
+- **WHEN** direct extraction completes
+- **THEN** the system SHALL convert PyMuPDF results to UnifiedDocument
+- **AND** preserve text styling, fonts, and exact positioning
+- **AND** extract tables with cell boundaries, content, and merge info
+- **AND** include only meaningful images in output
+
--- a/openspec/specs/ocr-processing/spec.md
+++ b/openspec/specs/ocr-processing/spec.md
@@ -195,3 +195,66 @@ The system SHALL provide documentation for cleaning up unused model caches to op
 - **THEN** the documentation SHALL explain how to delete unused cached models from `~/.paddlex/official_models/`
 - **AND** list which model directories can be safely removed

+### Requirement: Cell Over-Detection Filtering
+
+The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells.
+
+#### Scenario: Cell density exceeds threshold
+- **GIVEN** a table detected by PP-StructureV3 with cell_boxes
+- **WHEN** cell density exceeds 3.0 cells per 10,000 px²
+- **THEN** the system SHALL flag the table as over-detected
+- **AND** reclassify the table as a TEXT element
+
+#### Scenario: Average cell area below threshold
+- **GIVEN** a table detected by PP-StructureV3
+- **WHEN** average cell area is less than 3,000 px²
+- **THEN** the system SHALL flag the table as over-detected
+- **AND** reclassify the table as a TEXT element
+
+#### Scenario: Cell height too small
+- **GIVEN** a table with height H and N cells
+- **WHEN** (H / N) is less than 10 pixels
+- **THEN** the system SHALL flag the table as over-detected
+- **AND** reclassify the table as a TEXT element
+
+#### Scenario: Valid tables are preserved
+- **GIVEN** a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10)
+- **WHEN** validation is applied
+- **THEN** the table SHALL be preserved unchanged
+- **AND** all cell_boxes SHALL be retained
+
+### Requirement: Table-to-Text Reclassification
+
+The system SHALL convert over-detected tables to TEXT elements while preserving content.
+
+#### Scenario: Table content is preserved
+- **GIVEN** a table flagged for reclassification
+- **WHEN** converting to TEXT element
+- **THEN** the system SHALL extract text content from table HTML
+- **AND** preserve the original bounding box
+- **AND** set element type to TEXT
+
+#### Scenario: Reading order is recalculated
+- **GIVEN** tables have been reclassified as TEXT
+- **WHEN** assembling the final page structure
+- **THEN** the system SHALL recalculate reading order
+- **AND** sort elements by y0 then x0 coordinates
+
+### Requirement: Validation Configuration
+
+The system SHALL provide configurable thresholds for cell validation.
+
+#### Scenario: Default thresholds are applied
+- **GIVEN** no custom configuration is provided
+- **WHEN** validating tables
+- **THEN** the system SHALL use default thresholds:
+  - max_cell_density: 3.0 cells/10000px²
+  - min_avg_cell_area: 3000 px²
+  - min_cell_height: 10 px
+
+#### Scenario: Custom thresholds can be configured
+- **GIVEN** custom validation thresholds in configuration
+- **WHEN** validating tables
+- **THEN** the system SHALL use the custom values
+- **AND** apply them consistently to all pages
+