Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2.4 KiB
2.4 KiB
ADDED Requirements
Requirement: Cell Over-Detection Filtering
The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells.
Scenario: Cell density exceeds threshold
- GIVEN a table detected by PP-StructureV3 with cell_boxes
- WHEN cell density exceeds 3.0 cells per 10,000 px²
- THEN the system SHALL flag the table as over-detected
- AND reclassify the table as a TEXT element
Scenario: Average cell area below threshold
- GIVEN a table detected by PP-StructureV3
- WHEN average cell area is less than 3,000 px²
- THEN the system SHALL flag the table as over-detected
- AND reclassify the table as a TEXT element
Scenario: Cell height too small
- GIVEN a table with height H and N cells
- WHEN (H / N) is less than 10 pixels
- THEN the system SHALL flag the table as over-detected
- AND reclassify the table as a TEXT element
Scenario: Valid tables are preserved
- GIVEN a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10)
- WHEN validation is applied
- THEN the table SHALL be preserved unchanged
- AND all cell_boxes SHALL be retained
Requirement: Table-to-Text Reclassification
The system SHALL convert over-detected tables to TEXT elements while preserving content.
Scenario: Table content is preserved
- GIVEN a table flagged for reclassification
- WHEN converting to TEXT element
- THEN the system SHALL extract text content from table HTML
- AND preserve the original bounding box
- AND set element type to TEXT
Scenario: Reading order is recalculated
- GIVEN tables have been reclassified as TEXT
- WHEN assembling the final page structure
- THEN the system SHALL recalculate reading order
- AND sort elements by y0 then x0 coordinates
Requirement: Validation Configuration
The system SHALL provide configurable thresholds for cell validation.
Scenario: Default thresholds are applied
- GIVEN no custom configuration is provided
- WHEN validating tables
- THEN the system SHALL use default thresholds:
- max_cell_density: 3.0 cells/10000px²
- min_avg_cell_area: 3000 px²
- min_cell_height: 10 px
Scenario: Custom thresholds can be configured
- GIVEN custom validation thresholds in configuration
- WHEN validating tables
- THEN the system SHALL use the custom values
- AND apply them consistently to all pages