Files
OCR/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/specs/ocr-processing/spec.md
egg 940a406dce chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00

2.4 KiB

ADDED Requirements

Requirement: Cell Over-Detection Filtering

The system SHALL validate PP-StructureV3 table detections using metric-based heuristics to filter over-detected cells.

Scenario: Cell density exceeds threshold

  • GIVEN a table detected by PP-StructureV3 with cell_boxes
  • WHEN cell density exceeds 3.0 cells per 10,000 px²
  • THEN the system SHALL flag the table as over-detected
  • AND reclassify the table as a TEXT element

Scenario: Average cell area below threshold

  • GIVEN a table detected by PP-StructureV3
  • WHEN average cell area is less than 3,000 px²
  • THEN the system SHALL flag the table as over-detected
  • AND reclassify the table as a TEXT element

Scenario: Cell height too small

  • GIVEN a table with height H and N cells
  • WHEN (H / N) is less than 10 pixels
  • THEN the system SHALL flag the table as over-detected
  • AND reclassify the table as a TEXT element

Scenario: Valid tables are preserved

  • GIVEN a table with normal metrics (density < 3.0, avg area > 3000, height/N > 10)
  • WHEN validation is applied
  • THEN the table SHALL be preserved unchanged
  • AND all cell_boxes SHALL be retained

Requirement: Table-to-Text Reclassification

The system SHALL convert over-detected tables to TEXT elements while preserving content.

Scenario: Table content is preserved

  • GIVEN a table flagged for reclassification
  • WHEN converting to TEXT element
  • THEN the system SHALL extract text content from table HTML
  • AND preserve the original bounding box
  • AND set element type to TEXT

Scenario: Reading order is recalculated

  • GIVEN tables have been reclassified as TEXT
  • WHEN assembling the final page structure
  • THEN the system SHALL recalculate reading order
  • AND sort elements by y0 then x0 coordinates

Requirement: Validation Configuration

The system SHALL provide configurable thresholds for cell validation.

Scenario: Default thresholds are applied

  • GIVEN no custom configuration is provided
  • WHEN validating tables
  • THEN the system SHALL use default thresholds:
    • max_cell_density: 3.0 cells/10000px²
    • min_avg_cell_area: 3000 px²
    • min_cell_height: 10 px

Scenario: Custom thresholds can be configured

  • GIVEN custom validation thresholds in configuration
  • WHEN validating tables
  • THEN the system SHALL use the custom values
  • AND apply them consistently to all pages