# Change: Fix OCR Track Cell Over-Detection ## Why PP-StructureV3 is over-detecting table cells in OCR Track processing, incorrectly identifying regular text content (key-value pairs, bullet points, form labels) as table cells. This results in: - 4 tables detected instead of 1 on sample document - 105 cells detected instead of 12 (expected) - Broken text layout and incorrect font sizing in PDF output - Poor document reconstruction quality compared to Direct Track Evidence from task comparison: - Direct Track (`cfd996d9`): 1 table, 12 cells - correct representation - OCR Track (`62de32e0`): 4 tables, 105 cells - severe over-detection ## What Changes - Add post-detection cell validation pipeline to filter false-positive cells - Implement table structure validation using geometric patterns - Add text density analysis to distinguish tables from key-value text - Apply stricter confidence thresholds for cell detection - Add cell clustering algorithm to identify isolated false-positive cells ## Root Cause Analysis PP-StructureV3's cell detection models over-detect cells in structured text regions. Analysis of page 1: | Table | Cells | Density (cells/10000px²) | Avg Cell Area | Status | |-------|-------|--------------------------|---------------|--------| | 1 | 13 | 0.87 | 11,550 px² | Normal | | 2 | 12 | 0.44 | 22,754 px² | Normal | | **3** | **51** | **6.22** | **1,609 px²** | **Over-detected** | | 4 | 29 | 0.94 | 10,629 px² | Normal | **Table 3 anomalies:** - Cell density 7-14x higher than normal tables - Average cell area only 7-14% of normal - 150px height with 51 cells = ~3px per cell row (impossible) ## Proposed Solution: Post-Detection Cell Validation Apply metric-based filtering after PP-Structure detection: ### Filter 1: Cell Density Check - **Threshold**: Reject tables with density > 3.0 cells/10000px² - **Rationale**: Normal tables have 0.4-1.0 density; over-detected have 6+ ### Filter 2: Minimum Cell Area - **Threshold**: Reject tables with average cell area < 3,000 px² - **Rationale**: Normal cells are 10,000-25,000 px²; over-detected are ~1,600 px² ### Filter 3: Cell Height Validation - **Threshold**: Reject if (table_height / cell_count) < 10px - **Rationale**: Each cell row needs minimum height for readable text ### Filter 4: Reclassification - Tables failing validation are reclassified as TEXT elements - Original text content is preserved - Reading order is recalculated ## Impact - Affected specs: `ocr-processing` - Affected code: - `backend/app/services/ocr_service.py` - Add cell validation pipeline - `backend/app/services/processing_orchestrator.py` - Integrate validation - New file: `backend/app/services/cell_validation_engine.py` ## Success Criteria 1. OCR Track cell count matches Direct Track within 10% tolerance 2. No false-positive tables detected from non-tabular content 3. Table structure maintains logical row/column alignment 4. PDF output quality comparable to Direct Track for documents with tables