Files
OCR/openspec/changes/archive/2025-12-08-fix-ocr-cell-overdetection/proposal.md
egg 940a406dce chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 11:55:39 +08:00

2.9 KiB

Change: Fix OCR Track Cell Over-Detection

Why

PP-StructureV3 is over-detecting table cells in OCR Track processing, incorrectly identifying regular text content (key-value pairs, bullet points, form labels) as table cells. This results in:

  • 4 tables detected instead of 1 on sample document
  • 105 cells detected instead of 12 (expected)
  • Broken text layout and incorrect font sizing in PDF output
  • Poor document reconstruction quality compared to Direct Track

Evidence from task comparison:

  • Direct Track (cfd996d9): 1 table, 12 cells - correct representation
  • OCR Track (62de32e0): 4 tables, 105 cells - severe over-detection

What Changes

  • Add post-detection cell validation pipeline to filter false-positive cells
  • Implement table structure validation using geometric patterns
  • Add text density analysis to distinguish tables from key-value text
  • Apply stricter confidence thresholds for cell detection
  • Add cell clustering algorithm to identify isolated false-positive cells

Root Cause Analysis

PP-StructureV3's cell detection models over-detect cells in structured text regions. Analysis of page 1:

Table Cells Density (cells/10000px²) Avg Cell Area Status
1 13 0.87 11,550 px² Normal
2 12 0.44 22,754 px² Normal
3 51 6.22 1,609 px² Over-detected
4 29 0.94 10,629 px² Normal

Table 3 anomalies:

  • Cell density 7-14x higher than normal tables
  • Average cell area only 7-14% of normal
  • 150px height with 51 cells = ~3px per cell row (impossible)

Proposed Solution: Post-Detection Cell Validation

Apply metric-based filtering after PP-Structure detection:

Filter 1: Cell Density Check

  • Threshold: Reject tables with density > 3.0 cells/10000px²
  • Rationale: Normal tables have 0.4-1.0 density; over-detected have 6+

Filter 2: Minimum Cell Area

  • Threshold: Reject tables with average cell area < 3,000 px²
  • Rationale: Normal cells are 10,000-25,000 px²; over-detected are ~1,600 px²

Filter 3: Cell Height Validation

  • Threshold: Reject if (table_height / cell_count) < 10px
  • Rationale: Each cell row needs minimum height for readable text

Filter 4: Reclassification

  • Tables failing validation are reclassified as TEXT elements
  • Original text content is preserved
  • Reading order is recalculated

Impact

  • Affected specs: ocr-processing
  • Affected code:
    • backend/app/services/ocr_service.py - Add cell validation pipeline
    • backend/app/services/processing_orchestrator.py - Integrate validation
    • New file: backend/app/services/cell_validation_engine.py

Success Criteria

  1. OCR Track cell count matches Direct Track within 10% tolerance
  2. No false-positive tables detected from non-tabular content
  3. Table structure maintains logical row/column alignment
  4. PDF output quality comparable to Direct Track for documents with tables