chore: backup before code cleanup
Backup commit before executing remove-unused-code proposal. This includes all pending changes and new features. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,73 @@
|
||||
# Change: Fix OCR Track Cell Over-Detection
|
||||
|
||||
## Why
|
||||
|
||||
PP-StructureV3 is over-detecting table cells in OCR Track processing, incorrectly identifying regular text content (key-value pairs, bullet points, form labels) as table cells. This results in:
|
||||
- 4 tables detected instead of 1 on sample document
|
||||
- 105 cells detected instead of 12 (expected)
|
||||
- Broken text layout and incorrect font sizing in PDF output
|
||||
- Poor document reconstruction quality compared to Direct Track
|
||||
|
||||
Evidence from task comparison:
|
||||
- Direct Track (`cfd996d9`): 1 table, 12 cells - correct representation
|
||||
- OCR Track (`62de32e0`): 4 tables, 105 cells - severe over-detection
|
||||
|
||||
## What Changes
|
||||
|
||||
- Add post-detection cell validation pipeline to filter false-positive cells
|
||||
- Implement table structure validation using geometric patterns
|
||||
- Add text density analysis to distinguish tables from key-value text
|
||||
- Apply stricter confidence thresholds for cell detection
|
||||
- Add cell clustering algorithm to identify isolated false-positive cells
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
PP-StructureV3's cell detection models over-detect cells in structured text regions. Analysis of page 1:
|
||||
|
||||
| Table | Cells | Density (cells/10000px²) | Avg Cell Area | Status |
|
||||
|-------|-------|--------------------------|---------------|--------|
|
||||
| 1 | 13 | 0.87 | 11,550 px² | Normal |
|
||||
| 2 | 12 | 0.44 | 22,754 px² | Normal |
|
||||
| **3** | **51** | **6.22** | **1,609 px²** | **Over-detected** |
|
||||
| 4 | 29 | 0.94 | 10,629 px² | Normal |
|
||||
|
||||
**Table 3 anomalies:**
|
||||
- Cell density 7-14x higher than normal tables
|
||||
- Average cell area only 7-14% of normal
|
||||
- 150px height with 51 cells = ~3px per cell row (impossible)
|
||||
|
||||
## Proposed Solution: Post-Detection Cell Validation
|
||||
|
||||
Apply metric-based filtering after PP-Structure detection:
|
||||
|
||||
### Filter 1: Cell Density Check
|
||||
- **Threshold**: Reject tables with density > 3.0 cells/10000px²
|
||||
- **Rationale**: Normal tables have 0.4-1.0 density; over-detected have 6+
|
||||
|
||||
### Filter 2: Minimum Cell Area
|
||||
- **Threshold**: Reject tables with average cell area < 3,000 px²
|
||||
- **Rationale**: Normal cells are 10,000-25,000 px²; over-detected are ~1,600 px²
|
||||
|
||||
### Filter 3: Cell Height Validation
|
||||
- **Threshold**: Reject if (table_height / cell_count) < 10px
|
||||
- **Rationale**: Each cell row needs minimum height for readable text
|
||||
|
||||
### Filter 4: Reclassification
|
||||
- Tables failing validation are reclassified as TEXT elements
|
||||
- Original text content is preserved
|
||||
- Reading order is recalculated
|
||||
|
||||
## Impact
|
||||
|
||||
- Affected specs: `ocr-processing`
|
||||
- Affected code:
|
||||
- `backend/app/services/ocr_service.py` - Add cell validation pipeline
|
||||
- `backend/app/services/processing_orchestrator.py` - Integrate validation
|
||||
- New file: `backend/app/services/cell_validation_engine.py`
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. OCR Track cell count matches Direct Track within 10% tolerance
|
||||
2. No false-positive tables detected from non-tabular content
|
||||
3. Table structure maintains logical row/column alignment
|
||||
4. PDF output quality comparable to Direct Track for documents with tables
|
||||
Reference in New Issue
Block a user