egg/OCR

Files

egg 940a406dce chore: backup before code cleanup

Backup commit before executing remove-unused-code proposal.
This includes all pending changes and new features.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-11 11:55:39 +08:00

2.9 KiB

Raw Blame History

Change: Fix OCR Track Cell Over-Detection

Why

PP-StructureV3 is over-detecting table cells in OCR Track processing, incorrectly identifying regular text content (key-value pairs, bullet points, form labels) as table cells. This results in:

4 tables detected instead of 1 on sample document
105 cells detected instead of 12 (expected)
Broken text layout and incorrect font sizing in PDF output
Poor document reconstruction quality compared to Direct Track

Evidence from task comparison:

Direct Track (cfd996d9): 1 table, 12 cells - correct representation
OCR Track (62de32e0): 4 tables, 105 cells - severe over-detection

What Changes

Add post-detection cell validation pipeline to filter false-positive cells
Implement table structure validation using geometric patterns
Add text density analysis to distinguish tables from key-value text
Apply stricter confidence thresholds for cell detection
Add cell clustering algorithm to identify isolated false-positive cells

Root Cause Analysis

PP-StructureV3's cell detection models over-detect cells in structured text regions. Analysis of page 1:

Table	Cells	Density (cells/10000px²)	Avg Cell Area	Status
1	13	0.87	11,550 px²	Normal
2	12	0.44	22,754 px²	Normal
3	51	6.22	1,609 px²	Over-detected
4	29	0.94	10,629 px²	Normal

Table 3 anomalies:

Cell density 7-14x higher than normal tables
Average cell area only 7-14% of normal
150px height with 51 cells = ~3px per cell row (impossible)

Proposed Solution: Post-Detection Cell Validation

Apply metric-based filtering after PP-Structure detection:

Filter 1: Cell Density Check

Threshold: Reject tables with density > 3.0 cells/10000px²
Rationale: Normal tables have 0.4-1.0 density; over-detected have 6+

Filter 2: Minimum Cell Area

Threshold: Reject tables with average cell area < 3,000 px²
Rationale: Normal cells are 10,000-25,000 px²; over-detected are ~1,600 px²

Filter 3: Cell Height Validation

Threshold: Reject if (table_height / cell_count) < 10px
Rationale: Each cell row needs minimum height for readable text

Filter 4: Reclassification

Tables failing validation are reclassified as TEXT elements
Original text content is preserved
Reading order is recalculated

Impact

Affected specs: ocr-processing
Affected code:
- backend/app/services/ocr_service.py - Add cell validation pipeline
- backend/app/services/processing_orchestrator.py - Integrate validation
- New file: backend/app/services/cell_validation_engine.py

Success Criteria

OCR Track cell count matches Direct Track within 10% tolerance
No false-positive tables detected from non-tabular content
Table structure maintains logical row/column alignment
PDF output quality comparable to Direct Track for documents with tables

2.9 KiB Raw Blame History