Files
OCR/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/specs/ocr-processing/spec.md
egg 940a406dce chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00

65 lines
2.4 KiB
Markdown

## ADDED Requirements
### Requirement: Cell Over-Detection Filtering
The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells.
#### Scenario: Cell density exceeds threshold
- **GIVEN** a table detected by PP-StructureV3 with cell_boxes
- **WHEN** cell density exceeds 3.0 cells per 10,000 px²
- **THEN** the system SHALL flag the table as over-detected
- **AND** reclassify the table as a TEXT element
#### Scenario: Average cell area below threshold
- **GIVEN** a table detected by PP-StructureV3
- **WHEN** average cell area is less than 3,000 px²
- **THEN** the system SHALL flag the table as over-detected
- **AND** reclassify the table as a TEXT element
#### Scenario: Cell height too small
- **GIVEN** a table with height H and N cells
- **WHEN** (H / N) is less than 10 pixels
- **THEN** the system SHALL flag the table as over-detected
- **AND** reclassify the table as a TEXT element
#### Scenario: Valid tables are preserved
- **GIVEN** a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10)
- **WHEN** validation is applied
- **THEN** the table SHALL be preserved unchanged
- **AND** all cell_boxes SHALL be retained
### Requirement: Table-to-Text Reclassification
The system SHALL convert over-detected tables to TEXT elements while preserving content.
#### Scenario: Table content is preserved
- **GIVEN** a table flagged for reclassification
- **WHEN** converting to TEXT element
- **THEN** the system SHALL extract text content from table HTML
- **AND** preserve the original bounding box
- **AND** set element type to TEXT
#### Scenario: Reading order is recalculated
- **GIVEN** tables have been reclassified as TEXT
- **WHEN** assembling the final page structure
- **THEN** the system SHALL recalculate reading order
- **AND** sort elements by y0 then x0 coordinates
### Requirement: Validation Configuration
The system SHALL provide configurable thresholds for cell validation.
#### Scenario: Default thresholds are applied
- **GIVEN** no custom configuration is provided
- **WHEN** validating tables
- **THEN** the system SHALL use default thresholds:
- max_cell_density: 3.0 cells/10000px²
- min_avg_cell_area: 3000 px²
- min_cell_height: 10 px
#### Scenario: Custom thresholds can be configured
- **GIVEN** custom validation thresholds in configuration
- **WHEN** validating tables
- **THEN** the system SHALL use the custom values
- **AND** apply them consistently to all pages